feat(tuner): carry input_path, replace IDW with replay-then-rate validation (MK-25) #43

Merged

David merged 1 commit from feat/tuner-input-path-replay-validation-MK-25 into main

2026-05-25 03:56:43 +02:00

David commented

2026-05-25 03:48:57 +02:00

Owner

What

Implements MK-25: the feedback log now carries input_path on every run line, and monkey-tuner replaces the inverse-distance-weighted (IDW) score guess with a real replay-then-rate validation pass.

Follow-up to MK-10 (which shipped the tuner with IDW scoring and flagged this exact gap in its module header). Parent: MK-5.

Changes

monkey image auto --record writes input_path (the path the user passed, unnormalised) on each run record. The rating record is unchanged.
New src/lib.rs library target so the monkey-tuner binary can call the recipe runner in-process instead of shelling out. src/main.rs becomes a thin dispatcher over monkey::*. New recipe::replay(input, output, steps) applies an explicit recipe with the exact per-op dispatch auto uses, so a replayed output is byte-identical to a real run. No new dependency.
monkey-tuner tune: keeps the seeded 80/20 split + IDW grid search to rank candidates, then replays the top candidate (K=1 MVP) on the training inputs. Writes proposed outputs under $XDG_CACHE_HOME/monkey/tuner/<class>/<timestamp>/ and current-recipe outputs into a sibling current/ dir for A/B review, appends a run record per replayed output (carrying the proposed recipe_sha + original input_path), prints the monkey image rate commands, and writes <class>.next.toml with the IDW-predicted delta plus the replay-output-dir.
Legacy records without input_path are skipped with a rate-limited warning (first 5 named, then a summary count). Clean break, no mixed-mode fallback.
Missing input paths are skipped with a warning; if more than 50% of a class's inputs are gone the class aborts with a corpus-has-moved message.
monkey-tuner promote <class> --log <log> is the validation gate: re-reads the log, computes the measured proposed-vs-current delta over inputs rated under both recipe shas (matched by (input_path, recipe_sha)), refuses a non-positive delta unless --force, and stamps the measured delta onto the promoted recipe's leading comment. Without --log it falls back to the prior sha-pin-only behaviour.
New read-only monkey-tuner validate --class <name> --log <log> reports the measured delta without writing.
README ## Workflow gains the rate-the-replayed-outputs round-trip; ## Tuning rewritten for replay-then-rate.

Implementation notes / assumptions

The local corpus tree is gitignored and never reaches CI, so (per the issue) this PR provides the schema change and the replay scaffolding; the maintainer validates the empirical end-to-end loop locally. Two decisions were made where the spec left room, both documented in code:

The measured delta is computed over the input_paths rated under both the proposed and current recipe shas (the paired re-rated set), rather than a separate seeded held-out split. This realises "matched by input_path and recipe_sha" deterministically and needs no seed at promote time.
tune appends a run record for each replayed proposed output (input = output hash, input_path = original path, recipe_sha = proposed). This is what lets a later monkey image rate <output> join back to the proposal so promote/validate can measure it. The shortlist is fixed at K=1 (no --shortlist flag) per the MVP scope.

Testing

just check passes locally: fmt, clippy --all-targets -D warnings, build --all-targets, tests, and the builder-stage Docker (musl) compile.

New unit tests: input_path NDJSON round-trip, legacy-record skipping with rate-limited warning, missing-path replay branch + >50% abort threshold, measured-delta join (paired and unpaired), and promote non-positive-delta refusal.

Closes MK-25.

🤖 Generated with Claude Code

## What Implements MK-25: the feedback log now carries `input_path` on every `run` line, and `monkey-tuner` replaces the inverse-distance-weighted (IDW) score guess with a real replay-then-rate validation pass. Follow-up to MK-10 (which shipped the tuner with IDW scoring and flagged this exact gap in its module header). Parent: MK-5. ## Changes - `monkey image auto --record` writes `input_path` (the path the user passed, unnormalised) on each run record. The rating record is unchanged. - New `src/lib.rs` library target so the `monkey-tuner` binary can call the recipe runner in-process instead of shelling out. `src/main.rs` becomes a thin dispatcher over `monkey::*`. New `recipe::replay(input, output, steps)` applies an explicit recipe with the exact per-op dispatch `auto` uses, so a replayed output is byte-identical to a real run. No new dependency. - `monkey-tuner tune`: keeps the seeded 80/20 split + IDW grid search to rank candidates, then replays the top candidate (K=1 MVP) on the training inputs. Writes proposed outputs under `$XDG_CACHE_HOME/monkey/tuner/<class>/<timestamp>/` and current-recipe outputs into a sibling `current/` dir for A/B review, appends a `run` record per replayed output (carrying the proposed `recipe_sha` + original `input_path`), prints the `monkey image rate` commands, and writes `<class>.next.toml` with the IDW-predicted delta plus the replay-output-dir. - Legacy records without `input_path` are skipped with a rate-limited warning (first 5 named, then a summary count). Clean break, no mixed-mode fallback. - Missing input paths are skipped with a warning; if more than 50% of a class's inputs are gone the class aborts with a corpus-has-moved message. - `monkey-tuner promote <class> --log <log>` is the validation gate: re-reads the log, computes the measured proposed-vs-current delta over inputs rated under both recipe shas (matched by `(input_path, recipe_sha)`), refuses a non-positive delta unless `--force`, and stamps the measured delta onto the promoted recipe's leading comment. Without `--log` it falls back to the prior sha-pin-only behaviour. - New read-only `monkey-tuner validate --class <name> --log <log>` reports the measured delta without writing. - README `## Workflow` gains the rate-the-replayed-outputs round-trip; `## Tuning` rewritten for replay-then-rate. ## Implementation notes / assumptions The local corpus tree is gitignored and never reaches CI, so (per the issue) this PR provides the schema change and the replay scaffolding; the maintainer validates the empirical end-to-end loop locally. Two decisions were made where the spec left room, both documented in code: - The measured delta is computed over the `input_path`s rated under both the proposed and current recipe shas (the paired re-rated set), rather than a separate seeded held-out split. This realises "matched by input_path and recipe_sha" deterministically and needs no seed at promote time. - `tune` appends a `run` record for each replayed proposed output (`input` = output hash, `input_path` = original path, `recipe_sha` = proposed). This is what lets a later `monkey image rate <output>` join back to the proposal so `promote`/`validate` can measure it. The shortlist is fixed at K=1 (no `--shortlist` flag) per the MVP scope. ## Testing `just check` passes locally: fmt, clippy `--all-targets -D warnings`, build `--all-targets`, tests, and the builder-stage Docker (musl) compile. New unit tests: `input_path` NDJSON round-trip, legacy-record skipping with rate-limited warning, missing-path replay branch + >50% abort threshold, measured-delta join (paired and unpaired), and `promote` non-positive-delta refusal. Closes MK-25. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

David added 1 commit

2026-05-25 03:48:58 +02:00

feat(tuner): carry input_path, replace IDW guess with replay-then-rate

Check / fmt + clippy + build + tests (pull_request) Successful in 40s

Details

Create release / Create release from merged PR (pull_request) Has been skipped

Details

a33adf0dce

Closes the gap the MK-10 tuner header flagged: the feedback log now records the input file path on every `run` line, so `monkey-tuner` can re-run the pixel pipeline on the real corpus and report a measured score delta instead of an inverse-distance-weighted extrapolation. `monkey image auto --record` writes `input_path` (the path the user passed, unnormalised) on each run record; the rating record is unchanged.

To replay in-process without shelling out, the crate gains a `src/lib.rs` library target so the `monkey-tuner` binary can call the recipe runner directly: `monkey image auto` (`src/main.rs`) is now a thin dispatcher over `monkey::*`, and `recipe::replay(input, output, steps)` applies an explicit recipe with the exact per-op dispatch `auto` uses, so a replayed output is byte-identical to a real run of the same recipe. No new dependency is added.

`monkey-tuner tune` keeps the seeded 80/20 split and the IDW grid search to rank candidates, then replays the top candidate (K=1 for the MVP) on the training inputs: it writes the proposed outputs under `$XDG_CACHE_HOME/monkey/tuner/<class>/<timestamp>/` and the current-recipe outputs into a sibling `current/` dir for A/B review, appends a `run` record per replayed output (carrying the proposed `recipe_sha` and the original `input_path`, so a later rating joins by output hash), prints the `monkey image rate` commands to score them, and writes `<class>.next.toml` with the IDW-predicted delta plus the replay-output-dir. Run records that predate the schema (no `input_path`) are skipped with a rate-limited warning (first 5 named, then a summary count); a clean break, no mixed-mode fallback. If a training input's path is gone the tuner skips it with a warning, and if more than half a class's inputs are missing it aborts that class with a corpus-has-moved message.

`monkey-tuner promote <class> --log <log>` becomes the validation gate: it re-reads the log, computes the measured proposed-vs-current delta over the inputs rated under both recipe shas (matched by `(input_path, recipe_sha)` so one input rated under several variants does not cross-pollinate), refuses a non-positive delta unless `--force`, and stamps the measured delta onto the promoted recipe's leading comment. Without `--log` it falls back to the prior sha-pin-only behaviour. New read-only `monkey-tuner validate --class <name> --log <log>` reports the same measured delta without writing or promoting.

In-tree unit tests cover the `input_path` round-trip in the NDJSON record, legacy-record skipping with the rate-limited warning, the missing-path replay branch and the >50% abort threshold, the measured-delta join (paired and unpaired), and the `promote` non-positive-delta refusal (plus the existing sha-pin paths). README `## Workflow` gains the rate-the-replayed-outputs round-trip and `## Tuning` is rewritten for replay-then-rate. `just check` passes locally for fmt, clippy --all-targets -D warnings, build --all-targets, and tests; the corpus tree is gitignored so the empirical end-to-end loop is validated by the maintainer locally (per the issue).

#MK-25 State Done

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>