go: x/pkgsite: @master failed with "could not be found" for an hour

What is the URL of the page with the issue?

https://pkg.go.dev/cuelang.org/go@master

What is your user agent?

Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/111.0

Screenshot

image

(this screenshot was for https://pkg.go.dev/cuelang.org/go/cue/load@master, but the root of the module failed too)

What did you do?

Visit the page to view the docs at the latest master commit.

What did you expect to see?

It should work. I’ve used this many times in the past, and go get cuelang.org/go@master worked too.

What did you see instead?

The error above. This was today at around 10:19 London time, or 09:19 UTC time. We tried multiple times and it kept failing until about 11:00 London time.

I’ve looked at the pkgsite source code, and it’s interesting to see that it fetches its own meta tag via HTTP from cuelang.org, unlike go get which fetches https://proxy.golang.org/cuelang.org/go/@v/master.info. Why is that?

Something else to note here is that our meta tag page redirects; curl -v https://cuelang.org/go?go-get=1 does not show the meta tags, but curl -L -v https://cuelang.org/go?go-get=1 does. This should be fine, as both cmd/go and proxy.golang.org seem to follow redirects. I would imagine and hope that pkgsite does as well, but I would also hope that it would simply talk to proxy.golang.org instead of reimplementing the “fetch meta tags” logic.

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 20 (18 by maintainers)

Commits related to this issue

Most upvoted comments

Oof, there is a lot going on here. Jotting down notes for my own recollection; Dan and Paul: don’t feel obligated to follow along.

The previous CL fixed only one of approximately 4 “bugs”. Here’s the complete picture:

  1. pkgsite refreshes of “master” often timeout, when “master” is not fresh on the proxy.
  2. proxy timeouts were not retryable from the frontend <-- this is what I fixed in https://go.dev/cl/482162
  3. proxy timeouts are not retried on the fetch queue <-- this is what I am fixing in https://go.dev/cl/484736
  4. the master branch is automatically re-queued when users browse documentation@master <-- this is what I missed before
  5. GCP “de-dupes” against deleted/finished tasks for a few hours until they are garbage collected

So what happened is that master was successfully fetched yesterday at 1pm ET, and then at 6:30pm ET someone was browsing documentation@master, which caused master to be re-fetched asynchronously, and this fetch failed. At this point, master was in a broken state, and (1) subsequently re-fetching manually timed-out because the proxy needed to re-evaluate master (again), and then (2) couldn’t be re-attempted for some amount of time due to the GCP task queue deduplication.

:face_exhaling:

To fix this properly, I think we should make proxy timeouts retryable on the queue (as done in CL 484736), and avoid updating the version map for master when a fetch fails. It is better to serve a potentially stale master version (and continue trying to fetch it in the background) than to serve a 404.

@jamalc @hyangah WDYT?

But the existence of the module in the first place should follow the same logic as cmd/go, right?

Premably, yes. I was just explaining why the meta tags were queried.

I looked into this a bit. It looks like the problem is that the proxy timed out around 5am london, and this timeout is not correctly treated as a retryable error: https://cs.opensource.google/go/x/pkgsite/+/master:internal/frontend/fetch.go;l=455;drc=d37447d5241999cfc88746e21d57f2fdbcef6628

There’s a missing return: the status is set to a magic value that causes re-fetch, and then subsequently overwritten to NotFound.

The reason @peterhellberg’s request went through is that the pseudoversion did not exist in the version map.

I’ll fix.

@mvdan it could just be that the cue repo is larger than the typical modules you are browsing, and so times out more frequently.

I expect that most users just browse latest, not master, so it’s not that surprising to me that you are the first to have this combination of (1) frequently browsing master, (2) large modules that timeout, and (3) willingness to file an issue.

@peterhellberg mentioned that he requested v0.6.0-0.dev.0.20230406090537-f85172a7916b and that seemed to fix it - unclear if it was a coincidence in terms of timing or not.