Back to projects
Active Started May 2026

Agent DOE Engine

Design-of-experiments engine for tuning AI agents — finds which settings actually move your results, and the best combo when goals compete.

TypeScript Python NumPy Claude Code Plugin

The Problem

Tuning an agent the usual way fails twice. One-at-a-time testing is slow and blind — it misses settings that only matter in combination (a bigger batch size that only helps once you also add workers). And a single run can fool you: every run carries random variation, so a number that looks better might just be noise.

What I Built

A design-of-experiments engine that varies several settings together in one planned batch, then tells you which changes are real. You list the settings to test and the results you care about — speed, cost, quality, accuracy — and for each result it reports which settings moved it and by how much (including interaction-only effects), whether that movement is a real effect or a fluke, which effects are mathematically tangled in this design, and the single best configuration when goals compete.

Real Effect vs Fluke

Run the same setting twice and the result won’t be identical — that spread is the noise. A real effect is a change bigger than the noise; a fluke fits inside it. Every effect gets a p-value and a confidence interval, with a low-power warning when there aren’t enough runs to tell the two apart. The point: you don’t ship a change that was never real.

The Designs It Uses

It picks the smallest design that still answers the question:

SettingsDesignWhat it does
2–3Full factorial (4–8 runs)Tests every combination — most accurate, nothing tangled
4–7Fractional factorial (8 runs)A carefully chosen subset; reports which effects are tangled
8–11Plackett-Burman (12 runs)Screening — finds the few settings that matter out of many

For a single setting it falls back to a “try a change, measure, keep it if better” loop — cheaper to set up, but blind to interactions.

Competing Goals

When you care about numbers that fight each other (faster, but cheaper, but more accurate), it offers three ways to choose: scalarize (best weighted blend), desirability (every goal must clear a minimum bar), and pareto (show all the best trade-offs before committing).

Stack

TypeScript + Python (NumPy), packaged as @tyroneross/agent-doe-engine with a Claude Code plugin surface. Apache-2.0.