feat(pdf): add from-images and to-images subcommands (MK-4) #23

Merged
David merged 1 commit from feat/pdf-from-to-images into main 2026-05-18 12:02:20 +02:00
Owner

Summary

Add two new pdf verbs that round-trip the image -> PDF -> image flow.

monkey pdf from-images <INPUT...> <OUTPUT.pdf> bundles N images into one multi-page PDF using lopdf + flate2. Each input becomes a DeviceRGB XObject with a one-page content stream sized to the page MediaBox. Resurrects the assemble_pdf path removed by PR #17. Flags: --dpi <N> overrides input DPI metadata, --page-size <auto|letter|a4> picks between deriving the page size from pixel count + DPI vs. forcing a fixed page size. Clap's num_args = 2.. rejects "no inputs" at parse time.

monkey pdf to-images <INPUT.pdf> <OUTPUT_DIR> rasterizes each page by shelling out to pdftoppm from poppler-utils, mirroring how monkey video convert delegates to ffmpeg. Flags: --format <png|jpeg|tiff>, --dpi <N> (default 300), --first-page / --last-page, --basename <STEM> (default: PDF stem). Output filenames are <basename>-<N>.<ext> with pdftoppm's natural zero-padding. Missing pdftoppm produces an actionable error with the install hint, not a panic.

read_dpi is extracted from noteshrink to src/dpi.rs so both modules share it.

Runtime Dockerfile picks up poppler-utils alongside ffmpeg. CLAUDE.md notes pdftoppm as a runtime dep. pdf extract-images is unchanged: it solves the different problem of pulling embedded XObject streams without rasterization.

No new top-level Rust deps. 52 unit tests pass (added 8). Manual round-trip succeeds end-to-end.

#MK-4

Test plan

  • just check (fmt, clippy, build, tests, docker compile check)
  • monkey pdf from-images a.png b.png c.png out.pdf produces a 3-page PDF
  • monkey pdf to-images in.pdf out/ writes <stem>-N.png per page (pdftoppm present)
  • monkey pdf to-images --format jpeg --dpi 150 in.pdf out/ writes 150 DPI JPEGs
  • monkey pdf from-images out.pdf (no inputs) errors at clap parse with exit 2
  • Existing monkey pdf extract-images behaviour unchanged
  • Round-trip: from-images sample.png round.pdf && to-images round.pdf out/ && image diff sample.png out/round-1.png check.png --threshold 0.05 succeeds
  • Missing pdftoppm error path (manual: requires unsetting PATH; left to follow-up smoke test)
## Summary Add two new `pdf` verbs that round-trip the image -> PDF -> image flow. `monkey pdf from-images <INPUT...> <OUTPUT.pdf>` bundles N images into one multi-page PDF using `lopdf` + `flate2`. Each input becomes a `DeviceRGB` XObject with a one-page content stream sized to the page MediaBox. Resurrects the `assemble_pdf` path removed by PR #17. Flags: `--dpi <N>` overrides input DPI metadata, `--page-size <auto|letter|a4>` picks between deriving the page size from pixel count + DPI vs. forcing a fixed page size. Clap's `num_args = 2..` rejects "no inputs" at parse time. `monkey pdf to-images <INPUT.pdf> <OUTPUT_DIR>` rasterizes each page by shelling out to `pdftoppm` from `poppler-utils`, mirroring how `monkey video convert` delegates to `ffmpeg`. Flags: `--format <png|jpeg|tiff>`, `--dpi <N>` (default 300), `--first-page` / `--last-page`, `--basename <STEM>` (default: PDF stem). Output filenames are `<basename>-<N>.<ext>` with pdftoppm's natural zero-padding. Missing `pdftoppm` produces an actionable error with the install hint, not a panic. `read_dpi` is extracted from `noteshrink` to `src/dpi.rs` so both modules share it. Runtime Dockerfile picks up `poppler-utils` alongside `ffmpeg`. CLAUDE.md notes `pdftoppm` as a runtime dep. `pdf extract-images` is unchanged: it solves the different problem of pulling embedded XObject streams without rasterization. No new top-level Rust deps. 52 unit tests pass (added 8). Manual round-trip succeeds end-to-end. #MK-4 ## Test plan - [x] `just check` (fmt, clippy, build, tests, docker compile check) - [x] `monkey pdf from-images a.png b.png c.png out.pdf` produces a 3-page PDF - [x] `monkey pdf to-images in.pdf out/` writes `<stem>-N.png` per page (`pdftoppm` present) - [x] `monkey pdf to-images --format jpeg --dpi 150 in.pdf out/` writes 150 DPI JPEGs - [x] `monkey pdf from-images out.pdf` (no inputs) errors at clap parse with exit 2 - [x] Existing `monkey pdf extract-images` behaviour unchanged - [x] Round-trip: `from-images sample.png round.pdf && to-images round.pdf out/ && image diff sample.png out/round-1.png check.png --threshold 0.05` succeeds - [ ] Missing `pdftoppm` error path (manual: requires unsetting PATH; left to follow-up smoke test)
feat(pdf): add from-images and to-images subcommands
All checks were successful
Check / fmt + clippy + build + tests (pull_request) Successful in 17s
Create release / Create release from merged PR (pull_request) Has been skipped
fd13f6330a
Add two new `pdf` verbs that round-trip the image -> PDF -> image flow without leaving the monkey CLI.

`monkey pdf from-images <INPUT...> <OUTPUT.pdf>` bundles one or more images into a multi-page PDF using `lopdf` + `flate2`. Each input becomes a `DeviceRGB` XObject with a one-page content stream sized to the page MediaBox. Resurrects the assemble_pdf path that was removed by PR #17. Flags: `--dpi <N>` overrides input DPI metadata, `--page-size <auto|letter|a4>` picks between deriving from pixel count + DPI vs. forcing a fixed page size. Clap's `num_args = 2..` makes "no inputs" a parse error rather than a runtime one.

`monkey pdf to-images <INPUT.pdf> <OUTPUT_DIR>` rasterizes each page to an image file by shelling out to `pdftoppm` from `poppler-utils`, mirroring how `monkey video convert` delegates to `ffmpeg`. Flags: `--format <png|jpeg|tiff>` (default png), `--dpi <N>` (default 300), `--first-page` / `--last-page`, and `--basename <STEM>` (default: PDF stem). Output filenames are `<basename>-<N>.<ext>` with pdftoppm's natural zero-padding (matched to PDF page count). Missing `pdftoppm` produces an actionable error with the install hint, not a panic.

The `read_dpi` helper that lived in `src/noteshrink/mod.rs` is extracted to `src/dpi.rs` so both noteshrink (which preserves output DPI) and the new `pdf from-images` (which uses input DPI to derive page size) can share it.

Runtime Dockerfile installs `poppler-utils` alongside the existing `ffmpeg`. CLAUDE.md notes `pdftoppm` as a runtime dep of `pdf to-images`. The static-musl binary bundles neither; users install them themselves, same precedent as ffmpeg.

`pdf extract-images` is unchanged: it still pulls embedded XObject streams without rasterization, which is a different operation than `to-images`.

No new top-level Rust deps: `lopdf`, `flate2`, and `image` were already in `Cargo.toml`. 52 unit tests pass (added 8 for the new modules); manual round-trip `from-images sample.png round.pdf && to-images round.pdf out/ && image diff sample.png out/round-1.png` succeeds.

#MK-4 State Done
David merged commit 88a32752b4 into main 2026-05-18 12:02:20 +02:00
David deleted branch feat/pdf-from-to-images 2026-05-18 12:02:20 +02:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
pandoras-box/monkey!23
No description provided.