go: x/net/http2: server sends RST_STREAM w/ PROTOCOL_ERROR to clients it incorrectly believes have violated max advertised num streams

What version of Go are you using (go version)?

$ go version
go version go1.15.0 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What’s the issue

I’m seeing an issue with a grpc-go client talking to a http2.Server{}. I’ve documented the issue https://github.com/grpc/grpc-go/issues/4057. To summarize the issue:

  • I have a grpc-go client that is making requests to an http2.Servr{} on the same machine. The client is generating a lot of load on the server. The server is configured to allow a maximum of 320 requests on a single connection.
  • I’m seeing sporadic, and relatively regular, RST_STREAM w/ PROTOCOL_ERROR frames from the server.
  • The client usually has upwards of 500 simultaneous requests open at a time, which means there are 320 concurrent streams and ~200 streams waiting for these streams to complete. The machine that hosts the server is under heavy load.

I put some println statements in the client and server code and isolated the error that the server is returning to be from this line. This code checks the sc.curClientStreams and rejects the new stream if it believes the client is violating the advertised max num of concurrent streams. I added further logging to the server and the client to record the number of active streams and found that this number never creeps above 320 on the client or the server - the grpc-go client code believes it is being well behaved, which indicated to me something is fishy in the http2.Server code.

The only place in the http2.Server code that decremtnts sc.curClientStreams is here in sc.closeStream which is called from sc.wroteFrame, and sc.wroteFrame is only ever called in the sc.serve loop. Writes to the client can be triggered asynchronously by sc.writeFrameAsync (i.e. see here), and then once the write is complete the result is sent to the sc.wroteFrameCh which is read by the sc.serve loop (i.e. see here). The sc.serve loop contains a big select statement that reads from the sc.wroteFrameCh, but also the sc.readFrameCh which receives new frames from clients. Since these channels are read from the same select statement, if both contain a message then there is no guarantee which one will be read from first.

I believe what is happening to cause the spurious PROTOCOL_ERRORs is:

  • The server is maxed out with 320 concurrent streams.
  • One of the streams on the server finishes. This triggers a call to sc.writeFrameAsync to be spawned in a separate goroutine that writes a frame f1 that ends the stream.
  • sc.writeFrameAsync writes the frame f1 to the TCP connection and adds a message to the sc.wroteFrameCh.
  • The client reads frame f1 immediately, sees that the stream has completed, and therefore starts a new stream to the server by sending a frame f2.
  • The sc.readFrames loop reads frame f2 and pushes it onto the sc.readFrameCh channel.
  • Now, the sc.serve loop goes through another iteration, and since both sc.readFrameCh and sc.writeFrameAsync contain messages, it’s undefined which one it processes first. In this case, it processes the message on sc.readFrameCh first, which contains frame f2.
  • Since the message in the sc.wroteFrameCh has not been processed yet, the sc.curClientStreams variable has not yet been decremented and thus the server blieves that sc.curClientStreams is still set to 320. When sc.processFrameFromReader is called it will return a PROTOCOL_ERROR.

To attempt to fix this, I patched the http2.Server code:

diff --git a/vendor/golang.org/x/net/http2/server.go b/vendor/golang.org/x/net/http2/server.go
index 686501b4..a79fed98 100644
--- a/vendor/golang.org/x/net/http2/server.go
+++ b/vendor/golang.org/x/net/http2/server.go
@@ -868,6 +868,13 @@ func (sc *serverConn) serve() {
                case res := <-sc.wroteFrameCh:
                        sc.wroteFrame(res)
                case res := <-sc.readFrameCh:
+                       // Process any wrote frames before reading new frames from the client since a
+                       // wrote frame could have triggered a new stream to be started.
+                       select {
+                       case wroteRes := <-sc.wroteFrameCh:
+                               sc.wroteFrame(wroteRes)
+                       default:
+                       }
                        if !sc.processFrameFromReader(res) {
                                return
                        }

When I ran this on the server it significantly reduced the occurrences of the PRTOCOL_ERRORs I saw. I was seeing 20 every 3 minutes before the patch, and after I observed about 10 over the course of 2 hours.

I believe that this patch solves some of the problem, but not all of it, as the error still persists. Also, I may be totally wrong in my diagnosis above. Does the above reasoning make sense as a possible failure?

I’m going to try and get a minimal repro working, but wanted to create this issue first to get folks thoughts.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 17 (5 by maintainers)

Commits related to this issue

Most upvoted comments

@gopherbot please open backport issues for 1.16 and 1.17.

Change https://golang.org/cl/330909 mentions this issue: http2: complete outstanding async write before reading the next frame

I understand the failure. The trick is coming up with a test case which may not be so simple.