feat(tuner): build monkey-tuner binary for log-driven recipe tuning (MK-10) #42

Merged

David merged 1 commit from feat/monkey-tuner-MK-10 into main

2026-05-25 02:55:07 +02:00

David commented

2026-05-25 02:49:34 +02:00

Owner

MK-10: build monkey-tuner binary

Ships the second half of MK-5 Layer 3: a standalone monkey-tuner binary that consumes the feedback log written by monkey image auto --record / monkey image rate and closes the training loop. monkey itself is unchanged.

What it does

monkey-tuner tune --log <runs.jsonl> [--class <name>] [--seed N] [--recipe-dir DIR] reads the NDJSON log, joins each run to its most recent rating by the input blake3 hash, groups the pairs by class, and for every class with at least 20 rated runs grid-searches a per-class recipe parameter delta over a seeded 80/20 split. It writes recipes/<class>.next.toml with a provenance header (live-recipe sha, seed, sample sizes, training and held-out mean scores) and flags non-positive deltas with # REGRESSION. Classes below the minimum are skipped with a one-line message.
monkey-tuner replay --log <runs.jsonl> [--class <name>] reports the current-recipe mean score per class and writes nothing.
monkey-tuner promote <class> [--recipe-dir DIR] [--force] atomically renames <class>.next.toml to <class>.toml via rename(2), refuses when the proposal is missing, and refuses (unless --force) when the live recipe changed since the proposal was generated (its sha is pinned in the proposal header).

Scoring model: deliberate deviation from the spec wording

MK-10 step 4 describes re-running the recipe against each input to compute a new output_blake3 and looking up its rating. The feedback log (src/image/feedback.rs) records only blake3 hashes, never input paths or bytes, so the original inputs cannot be replayed; and a never-before-produced output would have no human rating to look up regardless. This PR implements the equivalent that the log schema actually supports: each rated run is one (args, score) observation, and a candidate argument set is scored by inverse-distance-weighted regression over those observations in normalised argument space. Grid search picks the best training point and the held-out split validates that it generalises. The deviation is documented in the module header, the README, and the commit body. True image replay is a follow-up that first needs the log to carry input paths.

Wiring

src/bin/monkey-tuner.rs is auto-discovered as a second binary (no Cargo.toml change, no new dependency: the seeded split uses StdRng already in the tree via rand).
The OCI runtime image and the static-musl binary export stage both copy monkey-tuner next to monkey; the build-binary workflow publishes monkey-tuner-linux-x86_64 to the generic package registry and Dufs alongside monkey-linux-x86_64.
README gains a ## Tuning subsection right after ## Workflow.

Verification

In-tree unit tests (18) cover NDJSON parsing including malformed-line and wrong-typed-record skipping, the run/rating join and class grouping, the grid generator, 80/20 split determinism under a fixed seed, tunable discovery, IDW scoring direction, recipe sha stability, recipe rendering round-trip, and the four promotion paths.
just check passes locally: fmt, clippy --all-targets -D warnings, build, tests, and the builder-stage Docker compile (which produces both musl binaries; the binary export stage was confirmed to emit monkey and monkey-tuner).
A synthetic 25-run log was exercised end to end (replay, tune, promote) to confirm the output format matches the spec example and that the malformed trailing line is skipped with a warning rather than aborting.

Closes MK-10.

🤖 Generated with Claude Code

## MK-10: build monkey-tuner binary Ships the second half of MK-5 Layer 3: a standalone `monkey-tuner` binary that consumes the feedback log written by `monkey image auto --record` / `monkey image rate` and closes the training loop. `monkey` itself is unchanged. ### What it does - `monkey-tuner tune --log <runs.jsonl> [--class <name>] [--seed N] [--recipe-dir DIR]` reads the NDJSON log, joins each run to its most recent rating by the input blake3 hash, groups the pairs by class, and for every class with at least 20 rated runs grid-searches a per-class recipe parameter delta over a seeded 80/20 split. It writes `recipes/<class>.next.toml` with a provenance header (live-recipe sha, seed, sample sizes, training and held-out mean scores) and flags non-positive deltas with `# REGRESSION`. Classes below the minimum are skipped with a one-line message. - `monkey-tuner replay --log <runs.jsonl> [--class <name>]` reports the current-recipe mean score per class and writes nothing. - `monkey-tuner promote <class> [--recipe-dir DIR] [--force]` atomically renames `<class>.next.toml` to `<class>.toml` via `rename(2)`, refuses when the proposal is missing, and refuses (unless `--force`) when the live recipe changed since the proposal was generated (its sha is pinned in the proposal header). ### Scoring model: deliberate deviation from the spec wording MK-10 step 4 describes re-running the recipe against each input to compute a new `output_blake3` and looking up its rating. The feedback log (`src/image/feedback.rs`) records only blake3 hashes, never input paths or bytes, so the original inputs cannot be replayed; and a never-before-produced output would have no human rating to look up regardless. This PR implements the equivalent that the log schema actually supports: each rated run is one `(args, score)` observation, and a candidate argument set is scored by inverse-distance-weighted regression over those observations in normalised argument space. Grid search picks the best training point and the held-out split validates that it generalises. The deviation is documented in the module header, the README, and the commit body. True image replay is a follow-up that first needs the log to carry input paths. ### Wiring - `src/bin/monkey-tuner.rs` is auto-discovered as a second binary (no `Cargo.toml` change, no new dependency: the seeded split uses `StdRng` already in the tree via `rand`). - The OCI runtime image and the static-musl `binary` export stage both copy `monkey-tuner` next to `monkey`; the build-binary workflow publishes `monkey-tuner-linux-x86_64` to the generic package registry and Dufs alongside `monkey-linux-x86_64`. - README gains a `## Tuning` subsection right after `## Workflow`. ### Verification - In-tree unit tests (18) cover NDJSON parsing including malformed-line and wrong-typed-record skipping, the run/rating join and class grouping, the grid generator, 80/20 split determinism under a fixed seed, tunable discovery, IDW scoring direction, recipe sha stability, recipe rendering round-trip, and the four promotion paths. - `just check` passes locally: fmt, `clippy --all-targets -D warnings`, build, tests, and the builder-stage Docker compile (which produces both musl binaries; the `binary` export stage was confirmed to emit `monkey` and `monkey-tuner`). - A synthetic 25-run log was exercised end to end (replay, tune, promote) to confirm the output format matches the spec example and that the malformed trailing line is skipped with a warning rather than aborting. Closes MK-10. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

David added 1 commit

2026-05-25 02:49:34 +02:00

feat(tuner): build monkey-tuner binary for log-driven recipe tuning

Check / fmt + clippy + build + tests (pull_request) Successful in 18s

Details

Create release / Create release from merged PR (pull_request) Has been skipped

Details

6b75dc3eb2

Ships the second half of MK-5 Layer 3: a standalone `monkey-tuner` binary that consumes the `monkey image auto --record` / `monkey image rate` feedback log and closes the training loop. It reads the NDJSON log, joins each run to its most recent rating by the input blake3 hash, groups the pairs by class, grid-searches a per-class recipe parameter delta over a seeded 80/20 split, writes `recipes/<class>.next.toml` with a provenance header (live-recipe sha, seed, sample sizes, train/held-out scores, REGRESSION flag when the delta is non-positive), and exposes `promote` (atomic `rename(2)` with a sha-pin guard and `--force`) and `replay` (report-only). The binary is standalone: it shares no code with the `monkey` hot path, adds no new dependency (seeded split uses StdRng, already in the tree via `rand`), and mirrors only the stable wire schema of the log records and recipe TOML.

Scoring deviates from the literal MK-10 wording on purpose, and the deviation is documented in the module header and the README. The spec describes re-running the recipe against each input to compute a new output_blake3 and looking up its rating, but the feedback log records only blake3 hashes (never input paths or bytes), so the original inputs cannot be replayed; and a never-before-produced output has no human rating to look up regardless. The implementable equivalent treats each rated run as one `(args, score)` observation and scores a candidate argument set by inverse-distance-weighted regression over those observations in normalised argument space. Grid search picks the best training point and the held-out split validates that it generalises. True image replay is a follow-up that first needs the log to carry input paths.

Wiring: `src/bin/monkey-tuner.rs` is auto-discovered as a second binary (no Cargo.toml change). The OCI runtime image and the static-musl `binary` export stage both copy `monkey-tuner` alongside `monkey`, and the build-binary workflow publishes `monkey-tuner-linux-x86_64` to the generic package registry and Dufs next to `monkey-linux-x86_64`. README gains a `## Tuning` subsection right after `## Workflow`.

In-tree unit tests cover NDJSON parsing (including malformed-line and wrong-typed-record skipping), the run/rating join and class grouping, the grid generator (single-point, integer rounding, float span), the 80/20 split determinism under a fixed seed, tunable discovery, IDW scoring direction, recipe sha stability, recipe rendering round-trip, and the four promotion paths (rename, missing-next refusal, sha-mismatch refusal, matching-pin accept). `just check` passes (fmt, clippy --all-targets -D warnings, build, tests) and the builder-stage Docker compile produces both musl binaries.

#MK-10 State Done

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>