fix(oci): blob cache total_size_bytes decode error broke LRU eviction (BUNYIP-41) #53

Merged
nrupard merged 2 commits from fix/bunyip-41-oci-blob-eviction-pool into main 2026-06-03 17:12:05 +02:00
Owner

Closes BUNYIP-41. The api logged oci blob cache eviction failed once per blob during every docker pull, and the byte cap was never enforced while pulls were active.

Actual root cause (the issue's RCA was wrong)

OciBlobCacheRepository::total_size_bytes ran SELECT SUM(size_bytes) and decoded the result as Option<i64>. Postgres returns SUM(bigint) as NUMERIC (to avoid overflow), not int8, so the row decode failed on every non-empty table:

ColumnDecode index 0: Rust type Option<i64> (INT8) is not compatible with SQL type NUMERIC

dunite-oci's blob cache calls total_size_bytes at the start of its LRU eviction pass, so eviction failed on every blob store. The failure is client-side (decoding the result), which is why Postgres logged nothing.

The issue hypothesized PgPool exhaustion / PoolTimedOut. That was a misdiagnosis: the failures clustered in well under the acquire timeout, the dev DB had ample connection headroom (16/100), and the query reached Postgres fine. I instrumented the raw sqlx::Error (the generic AppError mapping hid it) and it was a ColumnDecode, not a pool error. No pool-sizing change is needed.

Fix

One line: SELECT COALESCE(SUM(size_bytes), 0)::BIGINT, decoded as a non-null i64. Plus a DB-backed regression test (total_size_bytes_sums_without_decode_error).

Verification

rust-builder 1.94.1 container: clippy --workspace --all-targets -D warnings clean, fmt clean, 209 lib tests pass.

Live dev stack: a cold-cache docker pull (just verify-oci) that previously logged 7-8 eviction failures now logs zero, and the cache bookkeeping populates correctly (total_size_bytes returns the summed bytes, e.g. 8 rows / 40922561 bytes).

Filed dunite follow-ups (structural / observability, separate repo)

  • PSA-39: single debounced background evictor instead of one spawned task per blob store (the per-blob spawn is wasteful regardless; not the cause of this bug).
  • PSA-40: map sqlx errors (PoolTimedOut, and more broadly the decode case here) to distinct messages so the generic catch-all stops hiding the real cause - this investigation needed source-level instrumentation precisely because of that catch-all.
  • PSA-41: debug-level blob-cache hit/miss logging (observability gap noted on this issue).

🤖 Generated with Claude Code

Closes BUNYIP-41. The api logged `oci blob cache eviction failed` once per blob during every `docker pull`, and the byte cap was never enforced while pulls were active. ## Actual root cause (the issue's RCA was wrong) `OciBlobCacheRepository::total_size_bytes` ran `SELECT SUM(size_bytes)` and decoded the result as `Option<i64>`. Postgres returns `SUM(bigint)` as **NUMERIC** (to avoid overflow), not `int8`, so the row decode failed on every non-empty table: ``` ColumnDecode index 0: Rust type Option<i64> (INT8) is not compatible with SQL type NUMERIC ``` dunite-oci's blob cache calls `total_size_bytes` at the start of its LRU eviction pass, so eviction failed on every blob store. The failure is **client-side** (decoding the result), which is why Postgres logged nothing. The issue hypothesized PgPool exhaustion / `PoolTimedOut`. That was a misdiagnosis: the failures clustered in well under the acquire timeout, the dev DB had ample connection headroom (16/100), and the query reached Postgres fine. I instrumented the raw `sqlx::Error` (the generic `AppError` mapping hid it) and it was a `ColumnDecode`, not a pool error. No pool-sizing change is needed. ## Fix One line: `SELECT COALESCE(SUM(size_bytes), 0)::BIGINT`, decoded as a non-null `i64`. Plus a DB-backed regression test (`total_size_bytes_sums_without_decode_error`). ## Verification rust-builder 1.94.1 container: clippy `--workspace --all-targets -D warnings` clean, fmt clean, 209 lib tests pass. Live dev stack: a cold-cache `docker pull` (`just verify-oci`) that previously logged 7-8 eviction failures now logs **zero**, and the cache bookkeeping populates correctly (`total_size_bytes` returns the summed bytes, e.g. 8 rows / 40922561 bytes). ## Filed dunite follow-ups (structural / observability, separate repo) - PSA-39: single debounced background evictor instead of one spawned task per blob store (the per-blob spawn is wasteful regardless; not the cause of this bug). - PSA-40: map sqlx errors (PoolTimedOut, and more broadly the decode case here) to distinct messages so the generic catch-all stops hiding the real cause - this investigation needed source-level instrumentation precisely because of that catch-all. - PSA-41: debug-level blob-cache hit/miss logging (observability gap noted on this issue). 🤖 Generated with [Claude Code](https://claude.com/claude-code)
fix(oci): blob cache total_size_bytes decode error broke LRU eviction (BUNYIP-41)
All checks were successful
Check / fmt / clippy / build / test (pull_request) Successful in 1m5s
a27f51d78a
OciBlobCacheRepository::total_size_bytes ran `SELECT SUM(size_bytes)` and decoded the result as Option<i64>. Postgres returns SUM(bigint) as NUMERIC (not int8) to avoid overflow, so the decode failed on every non-empty table with "mismatched types ... INT8 is not compatible with NUMERIC". The dunite-oci blob cache calls total_size_bytes at the start of its LRU eviction pass, so eviction failed on every blob store during a pull (one "oci blob cache eviction failed" warning per blob) and the byte cap was never enforced while pulls were active.

The fix is a one-line SQL change: `SELECT COALESCE(SUM(size_bytes), 0)::BIGINT`, decoded as a non-null i64.

Root-cause note: the issue (BUNYIP-41) hypothesized PgPool exhaustion / PoolTimedOut. That was a misdiagnosis - the failures clustered in well under the acquire timeout, Postgres had ample connection headroom (16/100), and the query never reached the database (it failed client-side decoding the result). Instrumenting the raw sqlx error revealed the real ColumnDecode cause. No pool-sizing change is needed; this is purely the SUM-type bug.

Added a DB-backed regression test (total_size_bytes_sums_without_decode_error) that would have caught it.

Verified live: a cold-cache `docker pull` (verify-oci) that previously logged 7-8 eviction failures now logs zero, and the cache bookkeeping populates correctly (total_size_bytes returns the summed bytes). clippy/fmt clean, 209 lib tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: nrupard <natrsmith11@gmail.com>
fix(download): same SUM-as-i64 decode bug in download cache; harden BUNYIP-41 test
All checks were successful
Create release / Create release from merged PR (pull_request) Has been skipped
Check / fmt / clippy / build / test (pull_request) Successful in 1m3s
02dd8a86f1
PR #53 review findings:

- download_cache.rs total_size_bytes had the identical bug the OCI fix addresses: SELECT SUM(size_bytes) decoded as Option<i64> fails because Postgres SUM(bigint) is NUMERIC, so download-cache LRU eviction silently failed on every non-empty table (the download-vertical twin of BUNYIP-41, currently masked only because the table is empty in test/dev). Apply the same COALESCE(SUM(size_bytes),0)::BIGINT fix.
- The new oci regression test asserted an exact delta on the GLOBAL total_size_bytes sum, which is flaky because the sibling DB-gated tests run in parallel against the same table. Changed to assert the total grew by AT LEAST the inserted bytes, which still proves the query decodes (the actual bug) without depending on no concurrent inserts.

Verified: clippy -D warnings clean, fmt clean, 209 lib tests pass. The OCI fix is already live-verified (cold pull: eviction failures 7-8 -> 0); the download_cache fix is identical and verified by inspection (no DB-backed test harness exists in that module; a binary-download e2e is blocked by the BUNYIP-35 no-live-release-product gap).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: nrupard <natrsmith11@gmail.com>
nrupard deleted branch fix/bunyip-41-oci-blob-eviction-pool 2026-06-03 17:12:05 +02:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
psa-systems/bunyip!53
No description provided.