go: x/pkgsite: @master failed with "could not be found" for an hour
What is the URL of the page with the issue?
https://pkg.go.dev/cuelang.org/go@master
What is your user agent?
Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/111.0
Screenshot

(this screenshot was for https://pkg.go.dev/cuelang.org/go/cue/load@master, but the root of the module failed too)
What did you do?
Visit the page to view the docs at the latest master commit.
What did you expect to see?
It should work. I’ve used this many times in the past, and go get cuelang.org/go@master worked too.
What did you see instead?
The error above. This was today at around 10:19 London time, or 09:19 UTC time. We tried multiple times and it kept failing until about 11:00 London time.
I’ve looked at the pkgsite source code, and it’s interesting to see that it fetches its own meta tag via HTTP from cuelang.org, unlike go get which fetches https://proxy.golang.org/cuelang.org/go/@v/master.info. Why is that?
Something else to note here is that our meta tag page redirects; curl -v https://cuelang.org/go?go-get=1 does not show the meta tags, but curl -L -v https://cuelang.org/go?go-get=1 does. This should be fine, as both cmd/go and proxy.golang.org seem to follow redirects. I would imagine and hope that pkgsite does as well, but I would also hope that it would simply talk to proxy.golang.org instead of reimplementing the “fetch meta tags” logic.
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 20 (18 by maintainers)
Commits related to this issue
- internal/frontend: add missing early return to retry fetch errors A missing early return was causing retryable fetch errors not to be retried by the frontend. I spent a little while trying to test t... — committed to golang/pkgsite by findleyr a year ago
- internal/worker: proxy timeouts should be retryable When fetching e.g. the master branch, it is common for the proxy client to timeout as it waits for the branch to be resolved. This should be a ret... — committed to golang/pkgsite by findleyr a year ago
Oof, there is a lot going on here. Jotting down notes for my own recollection; Dan and Paul: don’t feel obligated to follow along.
The previous CL fixed only one of approximately 4 “bugs”. Here’s the complete picture:
So what happened is that master was successfully fetched yesterday at 1pm ET, and then at 6:30pm ET someone was browsing documentation@master, which caused master to be re-fetched asynchronously, and this fetch failed. At this point, master was in a broken state, and (1) subsequently re-fetching manually timed-out because the proxy needed to re-evaluate master (again), and then (2) couldn’t be re-attempted for some amount of time due to the GCP task queue deduplication.
:face_exhaling:
To fix this properly, I think we should make proxy timeouts retryable on the queue (as done in CL 484736), and avoid updating the version map for master when a fetch fails. It is better to serve a potentially stale master version (and continue trying to fetch it in the background) than to serve a 404.
@jamalc @hyangah WDYT?
Premably, yes. I was just explaining why the meta tags were queried.
I looked into this a bit. It looks like the problem is that the proxy timed out around 5am london, and this timeout is not correctly treated as a retryable error: https://cs.opensource.google/go/x/pkgsite/+/master:internal/frontend/fetch.go;l=455;drc=d37447d5241999cfc88746e21d57f2fdbcef6628
There’s a missing return: the status is set to a magic value that causes re-fetch, and then subsequently overwritten to NotFound.
The reason @peterhellberg’s request went through is that the pseudoversion did not exist in the version map.
I’ll fix.
@mvdan it could just be that the cue repo is larger than the typical modules you are browsing, and so times out more frequently.
I expect that most users just browse latest, not master, so it’s not that surprising to me that you are the first to have this combination of (1) frequently browsing master, (2) large modules that timeout, and (3) willingness to file an issue.
@peterhellberg mentioned that he requested
v0.6.0-0.dev.0.20230406090537-f85172a7916band that seemed to fix it - unclear if it was a coincidence in terms of timing or not.