feat(tuner): carry input_path, replace IDW with replay-then-rate validation (MK-25) #43

Merged
David merged 1 commit from feat/tuner-input-path-replay-validation-MK-25 into main 2026-05-25 03:56:43 +02:00
Owner

What

Implements MK-25: the feedback log now carries input_path on every run line, and monkey-tuner replaces the inverse-distance-weighted (IDW) score guess with a real replay-then-rate validation pass.

Follow-up to MK-10 (which shipped the tuner with IDW scoring and flagged this exact gap in its module header). Parent: MK-5.

Changes

  • monkey image auto --record writes input_path (the path the user passed, unnormalised) on each run record. The rating record is unchanged.
  • New src/lib.rs library target so the monkey-tuner binary can call the recipe runner in-process instead of shelling out. src/main.rs becomes a thin dispatcher over monkey::*. New recipe::replay(input, output, steps) applies an explicit recipe with the exact per-op dispatch auto uses, so a replayed output is byte-identical to a real run. No new dependency.
  • monkey-tuner tune: keeps the seeded 80/20 split + IDW grid search to rank candidates, then replays the top candidate (K=1 MVP) on the training inputs. Writes proposed outputs under $XDG_CACHE_HOME/monkey/tuner/<class>/<timestamp>/ and current-recipe outputs into a sibling current/ dir for A/B review, appends a run record per replayed output (carrying the proposed recipe_sha + original input_path), prints the monkey image rate commands, and writes <class>.next.toml with the IDW-predicted delta plus the replay-output-dir.
  • Legacy records without input_path are skipped with a rate-limited warning (first 5 named, then a summary count). Clean break, no mixed-mode fallback.
  • Missing input paths are skipped with a warning; if more than 50% of a class's inputs are gone the class aborts with a corpus-has-moved message.
  • monkey-tuner promote <class> --log <log> is the validation gate: re-reads the log, computes the measured proposed-vs-current delta over inputs rated under both recipe shas (matched by (input_path, recipe_sha)), refuses a non-positive delta unless --force, and stamps the measured delta onto the promoted recipe's leading comment. Without --log it falls back to the prior sha-pin-only behaviour.
  • New read-only monkey-tuner validate --class <name> --log <log> reports the measured delta without writing.
  • README ## Workflow gains the rate-the-replayed-outputs round-trip; ## Tuning rewritten for replay-then-rate.

Implementation notes / assumptions

The local corpus tree is gitignored and never reaches CI, so (per the issue) this PR provides the schema change and the replay scaffolding; the maintainer validates the empirical end-to-end loop locally. Two decisions were made where the spec left room, both documented in code:

  • The measured delta is computed over the input_paths rated under both the proposed and current recipe shas (the paired re-rated set), rather than a separate seeded held-out split. This realises "matched by input_path and recipe_sha" deterministically and needs no seed at promote time.
  • tune appends a run record for each replayed proposed output (input = output hash, input_path = original path, recipe_sha = proposed). This is what lets a later monkey image rate <output> join back to the proposal so promote/validate can measure it. The shortlist is fixed at K=1 (no --shortlist flag) per the MVP scope.

Testing

just check passes locally: fmt, clippy --all-targets -D warnings, build --all-targets, tests, and the builder-stage Docker (musl) compile.

New unit tests: input_path NDJSON round-trip, legacy-record skipping with rate-limited warning, missing-path replay branch + >50% abort threshold, measured-delta join (paired and unpaired), and promote non-positive-delta refusal.

Closes MK-25.

🤖 Generated with Claude Code

## What Implements MK-25: the feedback log now carries `input_path` on every `run` line, and `monkey-tuner` replaces the inverse-distance-weighted (IDW) score guess with a real replay-then-rate validation pass. Follow-up to MK-10 (which shipped the tuner with IDW scoring and flagged this exact gap in its module header). Parent: MK-5. ## Changes - `monkey image auto --record` writes `input_path` (the path the user passed, unnormalised) on each run record. The rating record is unchanged. - New `src/lib.rs` library target so the `monkey-tuner` binary can call the recipe runner in-process instead of shelling out. `src/main.rs` becomes a thin dispatcher over `monkey::*`. New `recipe::replay(input, output, steps)` applies an explicit recipe with the exact per-op dispatch `auto` uses, so a replayed output is byte-identical to a real run. No new dependency. - `monkey-tuner tune`: keeps the seeded 80/20 split + IDW grid search to rank candidates, then replays the top candidate (K=1 MVP) on the training inputs. Writes proposed outputs under `$XDG_CACHE_HOME/monkey/tuner/<class>/<timestamp>/` and current-recipe outputs into a sibling `current/` dir for A/B review, appends a `run` record per replayed output (carrying the proposed `recipe_sha` + original `input_path`), prints the `monkey image rate` commands, and writes `<class>.next.toml` with the IDW-predicted delta plus the replay-output-dir. - Legacy records without `input_path` are skipped with a rate-limited warning (first 5 named, then a summary count). Clean break, no mixed-mode fallback. - Missing input paths are skipped with a warning; if more than 50% of a class's inputs are gone the class aborts with a corpus-has-moved message. - `monkey-tuner promote <class> --log <log>` is the validation gate: re-reads the log, computes the measured proposed-vs-current delta over inputs rated under both recipe shas (matched by `(input_path, recipe_sha)`), refuses a non-positive delta unless `--force`, and stamps the measured delta onto the promoted recipe's leading comment. Without `--log` it falls back to the prior sha-pin-only behaviour. - New read-only `monkey-tuner validate --class <name> --log <log>` reports the measured delta without writing. - README `## Workflow` gains the rate-the-replayed-outputs round-trip; `## Tuning` rewritten for replay-then-rate. ## Implementation notes / assumptions The local corpus tree is gitignored and never reaches CI, so (per the issue) this PR provides the schema change and the replay scaffolding; the maintainer validates the empirical end-to-end loop locally. Two decisions were made where the spec left room, both documented in code: - The measured delta is computed over the `input_path`s rated under both the proposed and current recipe shas (the paired re-rated set), rather than a separate seeded held-out split. This realises "matched by input_path and recipe_sha" deterministically and needs no seed at promote time. - `tune` appends a `run` record for each replayed proposed output (`input` = output hash, `input_path` = original path, `recipe_sha` = proposed). This is what lets a later `monkey image rate <output>` join back to the proposal so `promote`/`validate` can measure it. The shortlist is fixed at K=1 (no `--shortlist` flag) per the MVP scope. ## Testing `just check` passes locally: fmt, clippy `--all-targets -D warnings`, build `--all-targets`, tests, and the builder-stage Docker (musl) compile. New unit tests: `input_path` NDJSON round-trip, legacy-record skipping with rate-limited warning, missing-path replay branch + >50% abort threshold, measured-delta join (paired and unpaired), and `promote` non-positive-delta refusal. Closes MK-25. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
feat(tuner): carry input_path, replace IDW guess with replay-then-rate
All checks were successful
Check / fmt + clippy + build + tests (pull_request) Successful in 40s
Create release / Create release from merged PR (pull_request) Has been skipped
a33adf0dce
Closes the gap the MK-10 tuner header flagged: the feedback log now records the input file path on every `run` line, so `monkey-tuner` can re-run the pixel pipeline on the real corpus and report a measured score delta instead of an inverse-distance-weighted extrapolation. `monkey image auto --record` writes `input_path` (the path the user passed, unnormalised) on each run record; the rating record is unchanged.

To replay in-process without shelling out, the crate gains a `src/lib.rs` library target so the `monkey-tuner` binary can call the recipe runner directly: `monkey image auto` (`src/main.rs`) is now a thin dispatcher over `monkey::*`, and `recipe::replay(input, output, steps)` applies an explicit recipe with the exact per-op dispatch `auto` uses, so a replayed output is byte-identical to a real run of the same recipe. No new dependency is added.

`monkey-tuner tune` keeps the seeded 80/20 split and the IDW grid search to rank candidates, then replays the top candidate (K=1 for the MVP) on the training inputs: it writes the proposed outputs under `$XDG_CACHE_HOME/monkey/tuner/<class>/<timestamp>/` and the current-recipe outputs into a sibling `current/` dir for A/B review, appends a `run` record per replayed output (carrying the proposed `recipe_sha` and the original `input_path`, so a later rating joins by output hash), prints the `monkey image rate` commands to score them, and writes `<class>.next.toml` with the IDW-predicted delta plus the replay-output-dir. Run records that predate the schema (no `input_path`) are skipped with a rate-limited warning (first 5 named, then a summary count); a clean break, no mixed-mode fallback. If a training input's path is gone the tuner skips it with a warning, and if more than half a class's inputs are missing it aborts that class with a corpus-has-moved message.

`monkey-tuner promote <class> --log <log>` becomes the validation gate: it re-reads the log, computes the measured proposed-vs-current delta over the inputs rated under both recipe shas (matched by `(input_path, recipe_sha)` so one input rated under several variants does not cross-pollinate), refuses a non-positive delta unless `--force`, and stamps the measured delta onto the promoted recipe's leading comment. Without `--log` it falls back to the prior sha-pin-only behaviour. New read-only `monkey-tuner validate --class <name> --log <log>` reports the same measured delta without writing or promoting.

In-tree unit tests cover the `input_path` round-trip in the NDJSON record, legacy-record skipping with the rate-limited warning, the missing-path replay branch and the >50% abort threshold, the measured-delta join (paired and unpaired), and the `promote` non-positive-delta refusal (plus the existing sha-pin paths). README `## Workflow` gains the rate-the-replayed-outputs round-trip and `## Tuning` is rewritten for replay-then-rate. `just check` passes locally for fmt, clippy --all-targets -D warnings, build --all-targets, and tests; the corpus tree is gitignored so the empirical end-to-end loop is validated by the maintainer locally (per the issue).

#MK-25 State Done

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
David merged commit 7a360ccc2c into main 2026-05-25 03:56:43 +02:00
David deleted branch feat/tuner-input-path-replay-validation-MK-25 2026-05-25 03:56:44 +02:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
pandoras-box/monkey!43
No description provided.