fix(ci): lock cargo cache mounts, cap build parallelism (FJMCP-2) #2

Merged
David merged 1 commit from fix/cargo-cache-mount-race-fjmcp-2 into main 2026-05-27 10:02:59 +02:00
Owner

Fixes the cargo cache-mount unpack race that failed the first CI run of the binary build workflows with failed to open .../http-1.4.1/.cargo-ok ("File exists (os error 17)").

Root cause

The Linux (build-binary.yml) and Windows (build-binary-windows.yml) workflows share triggers and the same runner label, so one push to main can run both concurrently on the same runner. Both Dockerfiles mount the cargo caches with --mount=type=cache,target=/usr/local/cargo/registry and .../git and no sharing, so BuildKit defaults to sharing=shared: both cargo processes extract into the same registry/src. Cargo's .package-cache extraction lock lives in each container's ephemeral $CARGO_HOME (not a cached mount), so the two builds cannot see each other's lock and race unpacking the same crate, colliding on its .cargo-ok marker.

Changes

  • Add sharing=locked to every cargo registry/git cache mount in oci-build/Dockerfile and oci-build/Dockerfile.windows (both the dependency-prime and real-source RUNs). BuildKit then serializes concurrent builds through one warm shared cache. The two cache-using RUNs within a single Dockerfile are already sequential, so the lock adds no penalty there.
  • Cap CARGO_BUILD_JOBS at nproc/2 in both workflows per governance CI.md "Concurrency cap" for two builds sharing a runner; full host nproc on each oversubscribed the runner. Comments updated to explain why.

Verification

  • just check-docker builds the builder stage clean with the locked mounts (verified locally).
  • The concurrent-green and no-recurrence ACs are validated by the next CI run on merge.

One-time runner action

The runner's buildx cache likely holds the stale partially-extracted http-1.4.1 with a leftover .cargo-ok from the failed run. Clear it once (docker buildx prune --filter type=exec.cachemount on the runner) so the locked mount does not trip over the corrupted entry. sharing=locked prevents recurrence.

Cross-reference

The sibling pandoras-box/forgejo-cli uses the same cache-mount pattern and full-nproc setting and likely carries the same latent defect (out of scope here).

Resolves FJMCP-2.

Fixes the cargo cache-mount unpack race that failed the first CI run of the binary build workflows with `failed to open .../http-1.4.1/.cargo-ok` ("File exists (os error 17)"). ## Root cause The Linux (`build-binary.yml`) and Windows (`build-binary-windows.yml`) workflows share triggers and the same runner label, so one push to `main` can run both concurrently on the same runner. Both Dockerfiles mount the cargo caches with `--mount=type=cache,target=/usr/local/cargo/registry` and `.../git` and no `sharing`, so BuildKit defaults to `sharing=shared`: both cargo processes extract into the same `registry/src`. Cargo's `.package-cache` extraction lock lives in each container's ephemeral `$CARGO_HOME` (not a cached mount), so the two builds cannot see each other's lock and race unpacking the same crate, colliding on its `.cargo-ok` marker. ## Changes - Add `sharing=locked` to every cargo registry/git cache mount in `oci-build/Dockerfile` and `oci-build/Dockerfile.windows` (both the dependency-prime and real-source RUNs). BuildKit then serializes concurrent builds through one warm shared cache. The two cache-using RUNs within a single Dockerfile are already sequential, so the lock adds no penalty there. - Cap `CARGO_BUILD_JOBS` at `nproc/2` in both workflows per governance `CI.md` "Concurrency cap" for two builds sharing a runner; full host nproc on each oversubscribed the runner. Comments updated to explain why. ## Verification - `just check-docker` builds the builder stage clean with the locked mounts (verified locally). - The concurrent-green and no-recurrence ACs are validated by the next CI run on merge. ## One-time runner action The runner's buildx cache likely holds the stale partially-extracted `http-1.4.1` with a leftover `.cargo-ok` from the failed run. Clear it once (`docker buildx prune --filter type=exec.cachemount` on the runner) so the locked mount does not trip over the corrupted entry. `sharing=locked` prevents recurrence. ## Cross-reference The sibling `pandoras-box/forgejo-cli` uses the same cache-mount pattern and full-nproc setting and likely carries the same latent defect (out of scope here). Resolves FJMCP-2.
fix(ci): lock cargo cache mounts, cap build parallelism (FJMCP-2)
All checks were successful
Check / fmt + clippy + build + tests (pull_request) Successful in 1m0s
Create release / Create release from merged PR (pull_request) Has been skipped
aff7a48ec2
Concurrent Linux and Windows binary builds share one runner and the same BuildKit cargo cache mounts. With BuildKit's default sharing=shared, both cargo processes extract into registry/src at once; cargo's .package-cache serialization lock lives in the ephemeral $CARGO_HOME (not a cached mount), so neither build sees the other's lock and they race unpacking the same crate, colliding on .cargo-ok ("File exists (os error 17)").

Add sharing=locked to every cargo registry/git cache mount in oci-build/Dockerfile and oci-build/Dockerfile.windows (both the dependency-prime and real-source RUNs). BuildKit then serializes concurrent builds through one warm shared cache. The two cache-using RUNs within a single Dockerfile are already sequential, so the lock adds no penalty there.

Cap CARGO_BUILD_JOBS at nproc/2 in build-binary.yml and build-binary-windows.yml per governance CI.md "Concurrency cap" for two builds sharing a runner; full host nproc on each oversubscribed the shared runner.

Note: a one-time runner buildx cache clear may be needed to evict the stale partially-extracted http-1.4.1 / .cargo-ok left by the failed run; sharing=locked prevents recurrence.

#FJMCP-2
David merged commit acd86962d4 into main 2026-05-27 10:02:59 +02:00
David deleted branch fix/cargo-cache-mount-race-fjmcp-2 2026-05-27 10:02:59 +02:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
pandoras-box/forgejo-mcp!2
No description provided.