The tools we built, in the open.
The eval and agent infrastructure behind Studio, released for the teams doing the same work. MIT-licensed, used in production.
The measurement spine for AI-native engineering teams — harness, fixtures, CI reporters, and regression gates. The same eval discipline we install on every engagement, in a library.
A spine for engineering-team eval suites — harness, fixtures, CI reporters.
Orchestration primitives for long-running agentic ops fleets.
Self-healing post-deploy fix loop — detect, patch, eval, merge.
Turn eval runs into a readable scorecard — trends, regressions.
The convention templates we install during Advisory.
Worked examples wiring evalkit into common CI systems.
Built something on these?
Issues, PRs, and questions are read directly. The same people who ship the Studio builds maintain the repos.