fix(ci): lock cargo cache mounts, cap build parallelism (FJMCP-2) #2
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "fix/cargo-cache-mount-race-fjmcp-2"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Fixes the cargo cache-mount unpack race that failed the first CI run of the binary build workflows with
failed to open .../http-1.4.1/.cargo-ok("File exists (os error 17)").Root cause
The Linux (
build-binary.yml) and Windows (build-binary-windows.yml) workflows share triggers and the same runner label, so one push tomaincan run both concurrently on the same runner. Both Dockerfiles mount the cargo caches with--mount=type=cache,target=/usr/local/cargo/registryand.../gitand nosharing, so BuildKit defaults tosharing=shared: both cargo processes extract into the sameregistry/src. Cargo's.package-cacheextraction lock lives in each container's ephemeral$CARGO_HOME(not a cached mount), so the two builds cannot see each other's lock and race unpacking the same crate, colliding on its.cargo-okmarker.Changes
sharing=lockedto every cargo registry/git cache mount inoci-build/Dockerfileandoci-build/Dockerfile.windows(both the dependency-prime and real-source RUNs). BuildKit then serializes concurrent builds through one warm shared cache. The two cache-using RUNs within a single Dockerfile are already sequential, so the lock adds no penalty there.CARGO_BUILD_JOBSatnproc/2in both workflows per governanceCI.md"Concurrency cap" for two builds sharing a runner; full host nproc on each oversubscribed the runner. Comments updated to explain why.Verification
just check-dockerbuilds the builder stage clean with the locked mounts (verified locally).One-time runner action
The runner's buildx cache likely holds the stale partially-extracted
http-1.4.1with a leftover.cargo-okfrom the failed run. Clear it once (docker buildx prune --filter type=exec.cachemounton the runner) so the locked mount does not trip over the corrupted entry.sharing=lockedprevents recurrence.Cross-reference
The sibling
pandoras-box/forgejo-cliuses the same cache-mount pattern and full-nproc setting and likely carries the same latent defect (out of scope here).Resolves FJMCP-2.
Concurrent Linux and Windows binary builds share one runner and the same BuildKit cargo cache mounts. With BuildKit's default sharing=shared, both cargo processes extract into registry/src at once; cargo's .package-cache serialization lock lives in the ephemeral $CARGO_HOME (not a cached mount), so neither build sees the other's lock and they race unpacking the same crate, colliding on .cargo-ok ("File exists (os error 17)"). Add sharing=locked to every cargo registry/git cache mount in oci-build/Dockerfile and oci-build/Dockerfile.windows (both the dependency-prime and real-source RUNs). BuildKit then serializes concurrent builds through one warm shared cache. The two cache-using RUNs within a single Dockerfile are already sequential, so the lock adds no penalty there. Cap CARGO_BUILD_JOBS at nproc/2 in build-binary.yml and build-binary-windows.yml per governance CI.md "Concurrency cap" for two builds sharing a runner; full host nproc on each oversubscribed the shared runner. Note: a one-time runner buildx cache clear may be needed to evict the stale partially-extracted http-1.4.1 / .cargo-ok left by the failed run; sharing=locked prevents recurrence. #FJMCP-2