perf(image): parallelize per-pixel kernels with rayon (MK-3) #22

Merged

David merged 1 commit from perf/parallelize-image-kernels into main

2026-05-18 03:31:15 +02:00

David commented

2026-05-18 03:26:14 +02:00

Owner

Summary

Bring rayon-based row parallelism to every per-pixel kernel in the image module. On a 32-core dev box monkey image smooth-median --radius 7 --iterations 4 on a 4000x3000 plasma input drops from 67.5 s to 5.0 s (13.5x speedup) and the output is byte-identical to pre-change main.

Shared dispatch helper in src/image/kernel.rs:

par_row_map(w, h, f) and par_row_map_chunked(w, h, chunk_rows, f) route a Fn(x, y) -> [f32; 3] + Sync closure across rayon-owned row strips. chunk_rows is configurable (defaults to 1) so callers can amortise scheduling overhead on small images.

Per-pixel kernels converted to par_row_map:

kernel.rs: gaussian_blur (both passes), gaussian_blur_xy (both passes), grayscale.
filters.rs: smooth_mean_curvature, anti_alias, despeckle, sharpen_tones, stamp.
magick_ops.rs: shave, downsample_for_detection, rotate_bilinear, colors (NeuQuant lookup).

Per-pixel kernels that stream over out.data directly use par_iter_mut() zipped with input slices: local_contrast, sharp_abstract unsharp pass, constrained_sharpen, blend::boost_screen, blend::vivid_screen_blend.

median_filter uses par_chunks_mut over output rows so each rayon task gets a thread-local Vec<f32> window scratch.

contrast_stretch builds a parallel histogram via par_chunks(...).fold(...).reduce(...) then runs the channel remap with par_iter_mut().enumerate().

detect_skew_angle evaluates the 0.5° candidate angles in parallel with par_iter().map(projection_variance).reduce(...). The reducer matches the sequential strictly-greater tie-break (prefers the lower angle on equal scores).

diff::diff parallelizes the per-pixel comparison over the raw RGBA byte buffer.

Out of scope (left alone):

bilateral_filter (MK-2) already has bespoke parallel chunked dispatch with LUTs and the interior fast path.
Iteration loops (each iteration consumes the previous one's output): smooth_bilateral, smooth_median, moire_removal, local_contrast, sharp_abstract, smooth_mean_curvature.
density (pure I/O).

No CLI surface change. No new dependency (rayon already pulled in by MK-2).

#MK-3

Test plan

just check (fmt, clippy, build, tests, docker compile check)
smooth-median --radius 7 --iterations 4 on 4000x3000: 67.5 s -> 5.0 s, byte-identical via cmp
All 44 unit tests pass (no behavioural regressions in shave/colors/colorspace/contrast_stretch/diff/etc.)

## Summary Bring rayon-based row parallelism to every per-pixel kernel in the image module. On a 32-core dev box `monkey image smooth-median --radius 7 --iterations 4` on a 4000x3000 plasma input drops from 67.5 s to 5.0 s (13.5x speedup) and the output is byte-identical to pre-change main. Shared dispatch helper in `src/image/kernel.rs`: - `par_row_map(w, h, f)` and `par_row_map_chunked(w, h, chunk_rows, f)` route a `Fn(x, y) -> [f32; 3] + Sync` closure across rayon-owned row strips. `chunk_rows` is configurable (defaults to 1) so callers can amortise scheduling overhead on small images. Per-pixel kernels converted to `par_row_map`: - `kernel.rs`: `gaussian_blur` (both passes), `gaussian_blur_xy` (both passes), `grayscale`. - `filters.rs`: `smooth_mean_curvature`, `anti_alias`, `despeckle`, `sharpen_tones`, `stamp`. - `magick_ops.rs`: `shave`, `downsample_for_detection`, `rotate_bilinear`, `colors` (NeuQuant lookup). Per-pixel kernels that stream over `out.data` directly use `par_iter_mut()` zipped with input slices: `local_contrast`, `sharp_abstract` unsharp pass, `constrained_sharpen`, `blend::boost_screen`, `blend::vivid_screen_blend`. `median_filter` uses `par_chunks_mut` over output rows so each rayon task gets a thread-local `Vec<f32>` window scratch. `contrast_stretch` builds a parallel histogram via `par_chunks(...).fold(...).reduce(...)` then runs the channel remap with `par_iter_mut().enumerate()`. `detect_skew_angle` evaluates the 0.5° candidate angles in parallel with `par_iter().map(projection_variance).reduce(...)`. The reducer matches the sequential strictly-greater tie-break (prefers the lower angle on equal scores). `diff::diff` parallelizes the per-pixel comparison over the raw RGBA byte buffer. Out of scope (left alone): - `bilateral_filter` (MK-2) already has bespoke parallel chunked dispatch with LUTs and the interior fast path. - Iteration loops (each iteration consumes the previous one's output): `smooth_bilateral`, `smooth_median`, `moire_removal`, `local_contrast`, `sharp_abstract`, `smooth_mean_curvature`. - `density` (pure I/O). No CLI surface change. No new dependency (`rayon` already pulled in by MK-2). #MK-3 ## Test plan - [x] `just check` (fmt, clippy, build, tests, docker compile check) - [x] `smooth-median --radius 7 --iterations 4` on 4000x3000: 67.5 s -> 5.0 s, byte-identical via `cmp` - [x] All 44 unit tests pass (no behavioural regressions in shave/colors/colorspace/contrast_stretch/diff/etc.)

David added 1 commit

2026-05-18 03:26:14 +02:00

perf(image): parallelize per-pixel kernels with rayon

Check / fmt + clippy + build + tests (pull_request) Successful in 18s

Details

Create release / Create release from merged PR (pull_request) Has been skipped

Details

1b4fd1fe41

Bring rayon-based row parallelism to every per-pixel kernel in the image module. On a 32-core box `monkey image smooth-median --radius 7 --iterations 4` on a 4000x3000 input drops from 67.5 s to 5.0 s (13.5x speedup) and the output is byte-identical to the pre-change implementation.

Shared dispatch helper in `src/image/kernel.rs`: `par_row_map(w, h, f)` and `par_row_map_chunked(w, h, chunk_rows, f)` route a `Fn(x, y) -> [f32; 3] + Sync` closure across rayon-owned row strips. Closures capture `&ImageBuf` and other immutable state by reference.

Per-pixel kernels converted to `par_row_map`:
- `kernel.rs`: gaussian_blur (both passes), gaussian_blur_xy (both passes), grayscale
- `filters.rs`: smooth_mean_curvature, anti_alias, despeckle, sharpen_tones, stamp
- `magick_ops.rs`: shave, downsample_for_detection, rotate_bilinear, colors (NeuQuant lookup)

Per-pixel kernels that stream over `out.data` directly (no x/y dependency) use `par_iter_mut()` zipped with the input slices: local_contrast, sharp_abstract unsharp pass, constrained_sharpen, blend::boost_screen, blend::vivid_screen_blend.

`median_filter` uses `par_chunks_mut` over output rows directly so each rayon task gets a thread-local `window: Vec<f32>` scratch buffer (per row, not per pixel).

`contrast_stretch` builds a parallel histogram via `par_chunks(3 * 4096).fold(...).reduce(...)` then runs the channel remap with `par_iter_mut().enumerate()` checking `i % 3 == ch`.

`detect_skew_angle` collects the 0.5-degree candidate angles into a Vec and evaluates them with `par_iter().map(projection_variance).reduce(...)`. The reducer matches the sequential strictly-greater tie-break (prefers the lower angle on equal scores) so the chosen rotation is order-independent.

`diff::diff` parallelizes the per-pixel comparison over the raw RGBA byte buffer with `par_chunks_exact_mut(4).zip(par_chunks_exact(4)).zip(par_chunks_exact(4))`.

Iteration loops in smooth_bilateral, smooth_median, moire_removal, local_contrast, sharp_abstract, smooth_mean_curvature remain sequential because each iteration consumes the previous iteration's output. Parallelism lives strictly inside each per-pixel pass.

`bilateral_filter` (MK-2) is left alone since it already has bespoke parallel chunked dispatch with LUTs and the interior fast path.

No CLI surface change. No new dependency (rayon is already pulled in by MK-2).

#MK-3 State Done