perf(bilateral): LUT, interior fast path, rayon, --fast (MK-2) #21

Merged
David merged 1 commit from perf/bilateral-lut into main 2026-05-18 03:13:51 +02:00
Owner

Summary

Cut monkey image smooth-bilateral wall time on a 1600x1200 plasma input (--spatial-sigma 10 --value-sigma 7 --iterations 2) from 33.6 s to ~1.0 s for the exact filter, and to ~0.1 s with the new --fast flag. Comfortably under the 3 s acceptance bar.

Optimizations land in this order:

  • 2D spatial-weight LUT (built once per call) replaces the inner exp(-spatial/spatial_denom).
  • 16384-bucket range-weight LUT over value_dist in [0, 3] replaces the inner exp(-value/value_denom). Bucket width ~0.000183, well below the 1-LSB correctness bar.
  • Interior fast path drops get_clamped and reads pixels by direct slice index for the majority of pixels.
  • Rayon row-parallelism over the output buffer (par_chunks_exact_mut).
  • New bilateral_filter_separable: horizontal then vertical 1D passes (r^2 -> 2r samples per pixel). Same LUT and parallelism.
  • New --fast CLI flag on monkey image smooth-bilateral selects the separable approximation; default stays exact. sharp_abstract keeps the exact filter.

One new dependency: rayon = "1".

#MK-2

Test plan

  • just check (fmt, clippy, build, tests, docker compile check)
  • Manual: smooth-bilateral exact on 1600x1200 plasma finishes in ~1.0 s
  • Manual: smooth-bilateral --fast on same input finishes in ~0.1 s
  • Manual: monkey image diff exact.png fast.png --threshold 0.005 succeeds
  • Manual: sharp-abstract still produces output (uses exact filter unchanged)
## Summary Cut `monkey image smooth-bilateral` wall time on a 1600x1200 plasma input (`--spatial-sigma 10 --value-sigma 7 --iterations 2`) from 33.6 s to ~1.0 s for the exact filter, and to ~0.1 s with the new `--fast` flag. Comfortably under the 3 s acceptance bar. Optimizations land in this order: - 2D spatial-weight LUT (built once per call) replaces the inner `exp(-spatial/spatial_denom)`. - 16384-bucket range-weight LUT over `value_dist` in `[0, 3]` replaces the inner `exp(-value/value_denom)`. Bucket width ~0.000183, well below the 1-LSB correctness bar. - Interior fast path drops `get_clamped` and reads pixels by direct slice index for the majority of pixels. - Rayon row-parallelism over the output buffer (`par_chunks_exact_mut`). - New `bilateral_filter_separable`: horizontal then vertical 1D passes (`r^2` -> `2r` samples per pixel). Same LUT and parallelism. - New `--fast` CLI flag on `monkey image smooth-bilateral` selects the separable approximation; default stays exact. `sharp_abstract` keeps the exact filter. One new dependency: `rayon = "1"`. #MK-2 ## Test plan - [x] `just check` (fmt, clippy, build, tests, docker compile check) - [x] Manual: `smooth-bilateral` exact on 1600x1200 plasma finishes in ~1.0 s - [x] Manual: `smooth-bilateral --fast` on same input finishes in ~0.1 s - [x] Manual: `monkey image diff exact.png fast.png --threshold 0.005` succeeds - [x] Manual: `sharp-abstract` still produces output (uses exact filter unchanged)
perf(bilateral): LUT, interior fast path, rayon, separable --fast
Some checks failed
Check / fmt + clippy + build + tests (pull_request) Failing after 17s
Create release / Create release from merged PR (pull_request) Has been skipped
78ce92d605
Optimize `bilateral_filter` and add a `bilateral_filter_separable` variant exposed via `monkey image smooth-bilateral --fast`. The exact filter goes from 33.6 s to ~1.0 s on a 1600x1200 plasma image at `--spatial-sigma 10 --value-sigma 7 --iterations 2`; `--fast` finishes in ~0.1 s.

Changes:
- Precompute a 2D spatial-weight LUT once per call so the inner `exp(-spatial/spatial_denom)` is replaced by a table lookup.
- Precompute a 16384-bucket range-weight LUT covering `value_dist` in `[0, 3]`. Bucket width is ~0.000183, comfortably below the 1-LSB-per-channel correctness bar.
- Split the per-pixel loop into interior and edge variants. The interior path drops `get_clamped` entirely and reads pixels by direct slice index.
- Parallelize the outer row loop with `rayon::par_chunks_exact_mut` on the output buffer.
- Add `bilateral_filter_separable`: horizontal then vertical 1D bilateral pass, cutting the inner loop from `r^2` to `2r` samples per pixel. Same LUT and parallelism as the exact version.
- New `--fast` flag on `monkey image smooth-bilateral` selects the separable approximation. Default is the exact filter.
- `sharp_abstract` continues to use the exact filter; no behaviour change there.

Single new dependency: `rayon = "1"`.

#MK-2 State Done
David merged commit 24d83c4354 into main 2026-05-18 03:13:51 +02:00
David deleted branch perf/bilateral-lut 2026-05-18 03:13:51 +02:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
pandoras-box/monkey!21
No description provided.