azure-sdk-for-go: Outbound calls timing out with `context deadline exceeded` when TCP connection is been drop

Bug Report

Opening this for visibility in case other folks are having this issue !

TLDR : https://github.com/golang/go/issues/59690

  • What happened?

When using HTTP/2, which is defaulted everywhere now, if the keep-alive connection is been dropped somehow, the sdk clients cannot recover causing some timeout context deadline exceeded until the connection been closed by the host.

  • What did you expect or want to happen ?

When using a HTTP/2 server, CosmosDB for example, the SDK transport should PING the server to identify is the connection is not stale.

  • How can we reproduce it?

I’m able to reproduce this using CosmosDB sample code such as :

package main

import (
	"context"
	"crypto/tls"
	"log"
	"net"
	"net/http"
	"time"

	logger "github.com/Azure/azure-sdk-for-go/sdk/azcore/log"
	"github.com/Azure/azure-sdk-for-go/sdk/azcore/policy"
	"github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos"
)

const (
	cosmosDbEndpoint = "https://julienstestcosmos.documents.azure.com"
	cosmosDbKey      = "REDACTED"
	dbName           = "julienstest"
	containerName    = "Subscriptions"
	pk               = "c88b61d7-85aa-415f-8eda-4e73e5e811d5"
	id               = "c88b61d7-85aa-415f-8eda-4e73e5e811d5"
)

func main() {
	ctx := context.Background()
	opt := &azcosmos.ClientOptions{ClientOptions: policy.ClientOptions{
		// we call cosmosDB more frequent and CosmosDB's latency is much less than ARM
		// so we use difference retry policy than ARM
		Retry: policy.RetryOptions{
			// Create/Update/Delete option will get ItemAlreadyExists error in second retry if first one already arrive server.
			StatusCodes: []int{
				// azcore defaults, see RetryOptions doc
				http.StatusRequestTimeout,      // 408
				http.StatusTooManyRequests,     // 429
				http.StatusInternalServerError, // 500
				http.StatusBadGateway,          // 502
				http.StatusServiceUnavailable,  // 503
				http.StatusGatewayTimeout,      // 504
				// adding 403 to work around random 403 from cosmosdb
				// see https://github.com/Azure/azure-sdk-for-go/issues/19785
				http.StatusForbidden, // 403
			},
			MaxRetries:    3,
			TryTimeout:    5 * time.Second, // The maximum latency in server side in slow EUAP regions are 2-3 seconds today
			RetryDelay:    1 * time.Second,
			MaxRetryDelay: 2 * time.Second, // Delay increases exponentially, delay from 1 second, 2 time will be 2 seconds
		},
	}}
	logger.SetListener(func(cls logger.Event, msg string) {
		log.Printf("%s: %s\n", cls, msg)
	})

	creds, err := azcosmos.NewKeyCredential(cosmosDbKey)
	if err != nil {
		log.Fatalf("failed to create credential: %v", err)
	}
	client, err := azcosmos.NewClientWithKey(cosmosDbEndpoint, creds, opt)
	if err != nil {
		log.Fatalf("failed to create NewClientWithKey: %v", err)
	}
	container, err := client.NewContainer(dbName, containerName)
	if err != nil {
		log.Fatalf("failed to create NewContainer: %v", err)
	}
	itemResponse, err := container.ReadItem(ctx, azcosmos.NewPartitionKeyString(id), id, nil)
	if err != nil {
		log.Fatalf("failed to ReadItem: %v", err)
	}
	// log.Print(itemResponse.ActivityID)

	// Set up an infinite loop to place requests from
	for {
		_, err := container.ReadItem(ctx, azcosmos.NewPartitionKeyString(id), id, nil)
		if err != nil {
			log.Printf("Timed out with error: %v", err)
		} else {
			log.Printf("Obtained response with activity ID: %s", itemResponse.ActivityID)
		}
		time.Sleep(5 * time.Second)
	}
}

Then run go run main.go

Check for the port been used to talk to cosmos netstat -anp | grep main

Drop the connection sudo iptables -I INPUT -p tcp --dport 54420 -j DROP where 54420 is the port returned by netstat`

This is what you should see

2023/08/10 04:48:48 Obtained response with activity ID: ffbe9497-7be3-4595-a17b-4d46667e0ab5
2023/08/10 04:48:50 Obtained response with activity ID: ffbe9497-7be3-4595-a17b-4d46667e0ab5
2023/08/10 04:49:17 Timed out with error: Get "https://julienstestcosmos.documents.azure.com/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5": context deadline exceeded
2023/08/10 04:49:45 Timed out with error: Get "https://julienstestcosmos.documents.azure.com/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5": context deadline exceeded
2023/08/10 04:50:12 Timed out with error: Get "https://julienstestcosmos.documents.azure.com/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5": context deadline exceeded
2023/08/10 04:50:39 Timed out with error: Get "https://julienstestcosmos.documents.azure.com/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5": context deadline exceeded
2023/08/10 04:51:06 Timed out with error: Get "https://julienstestcosmos.documents.azure.com/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5": context deadline exceeded

The connection should recover at some point (based on the host that will kill the keep-alive connection at some point, might take up to 15minutes)

After setting the HTTP/2 transport properly as following :

	http2Transport, err := http2.ConfigureTransports(tr2)
	if err == nil {
		log.Println("HTTP/2 transport configured")
		// ReadIdleTimeout is the timeout after which a health check using ping
		// frame will be carried out if no frame is received on the connection.
		// Note that a ping response will is considered a received frame, so if
		// there is no other traffic on the connection, the health check will
		// be performed every ReadIdleTimeout interval.
		// If zero, no health check is performed.
		http2Transport.ReadIdleTimeout = time.Second * 10
		// PingTimeout is the timeout after which the connection will be closed
		// if a response to Ping is not received.
		// Defaults to 15s.
		http2Transport.PingTimeout = time.Second * 2
		// No error so we can use the new transport
		tr = tr2
	} else {
		log.Printf("HTTP/2 transport configuration failed: %v", err)
	}

Then running it with export GODEBUG=http2debug=2 & go run main.go

```bash
2023/08/10 05:10:39 Retry: =====> Try=1
2023/08/10 05:10:39 Request: ==> OUTGOING REQUEST (Try=1)
   GET https://julienstestcosmos.documents.azure.com/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5
   Authorization: REDACTED
   User-Agent: azsdk-go-azcosmos/v0.3.5 (go1.20.7; linux)
   X-Ms-Cosmos-Sdk-Supportedcapabilities: REDACTED
   X-Ms-Date: REDACTED
   X-Ms-Documentdb-Partitionkey: REDACTED
   X-Ms-Version: REDACTED

2023/08/10 05:10:39 http2: Transport encoding header ":authority" = "julienstestcosmos.documents.azure.com"
2023/08/10 05:10:39 http2: Transport encoding header ":method" = "GET"
2023/08/10 05:10:39 http2: Transport encoding header ":path" = "/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5"
2023/08/10 05:10:39 http2: Transport encoding header ":scheme" = "https"
2023/08/10 05:10:39 http2: Transport encoding header "x-ms-date" = "Thu, 10 Aug 2023 05:10:39 GMT"
2023/08/10 05:10:39 http2: Transport encoding header "x-ms-version" = "2020-11-05"
2023/08/10 05:10:39 http2: Transport encoding header "x-ms-cosmos-sdk-supportedcapabilities" = "1"
2023/08/10 05:10:39 http2: Transport encoding header "user-agent" = "azsdk-go-azcosmos/v0.3.5 (go1.20.7; linux)"
2023/08/10 05:10:39 http2: Transport encoding header "x-ms-documentdb-partitionkey" = "[\"c88b61d7-85aa-415f-8eda-4e73e5e811d5\"]"
2023/08/10 05:10:39 http2: Transport encoding header "authorization" = "type%3Dmaster%26ver%3D1.0%26sig%3D6kKN4RGlxyMfV3Q%2FA6avRQSxTzW9j3ggfGLf5Yzpmz0%3D"
2023/08/10 05:10:39 http2: Transport encoding header "accept-encoding" = "gzip"
2023/08/10 05:10:39 http2: Framer 0xc00019d420: wrote HEADERS flags=END_STREAM|END_HEADERS stream=13 len=100
2023/08/10 05:10:44 RoundTrip failure: context deadline exceeded
2023/08/10 05:10:44 Response: ==> REQUEST/RESPONSE (Try=1/5.002407971s, OpTime=5.002449671s) -- REQUEST ERROR
   GET https://julienstestcosmos.documents.azure.com/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5
   Authorization: REDACTED
   User-Agent: azsdk-go-azcosmos/v0.3.5 (go1.20.7; linux)
   X-Ms-Cosmos-Sdk-Supportedcapabilities: REDACTED
   X-Ms-Date: REDACTED
   X-Ms-Documentdb-Partitionkey: REDACTED
   X-Ms-Version: REDACTED
   --------------------------------------------------------------------------------
   ERROR:
Get "https://julienstestcosmos.documents.azure.com/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5": context deadline exceeded
github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.(*logPolicy).Do()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/runtime/policy_logging.go:140
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/request.go:84
github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*sharedKeyCredPolicy).Do()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v0.3.5/shared_key_credential.go:120
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/request.go:84
github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.(*retryPolicy).Do()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/runtime/policy_retry.go:134
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/request.go:84
github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*headerPolicies).Do()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v0.3.5/cosmos_headers_policy.go:51
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/request.go:84
github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.telemetryPolicy.Do()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/runtime/policy_telemetry.go:66
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/request.go:84
github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.includeResponsePolicy()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/runtime/policy_include_response.go:19
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.PolicyFunc.Do()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/request.go:181
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/request.go:84
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.Pipeline.Do()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/pipeline.go:96
github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*Client).executeAndEnsureSuccessResponse()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v0.3.5/cosmos_client.go:428
github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*Client).sendGetRequest()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v0.3.5/cosmos_client.go:310
github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*ContainerClient).ReadItem()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v0.3.5/cosmos_container.go:348
main.main()
	/home/juliens/go/src/batman/h2/main.go:116
runtime.main()
	/snap/go/current/src/runtime/proc.go:250
runtime.goexit()
	/snap/go/current/src/runtime/asm_amd64.s:1598

2023/08/10 05:10:44 Retry: error Get "https://julienstestcosmos.documents.azure.com/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5": context deadline exceeded
2023/08/10 05:10:44 Retry: End Try #1, Delay=973.05341ms
2023/08/10 05:10:44 http2: Transport sending health check
2023/08/10 05:10:44 http2: Framer 0xc00019d420: wrote RST_STREAM stream=13 len=4 ErrCode=CANCEL
2023/08/10 05:10:44 http2: Framer 0xc00019d420: wrote PING len=8 ping="+\x1a\x9c\xf3\xfc\xae{W"
2023/08/10 05:10:45 Retry: =====> Try=2
2023/08/10 05:10:45 Request: ==> OUTGOING REQUEST (Try=2)
   GET https://julienstestcosmos.documents.azure.com/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5
   Authorization: REDACTED
   User-Agent: azsdk-go-azcosmos/v0.3.5 (go1.20.7; linux)
   X-Ms-Cosmos-Sdk-Supportedcapabilities: REDACTED
   X-Ms-Date: REDACTED
   X-Ms-Documentdb-Partitionkey: REDACTED
   X-Ms-Version: REDACTED

2023/08/10 05:10:45 http2: Transport encoding header ":authority" = "julienstestcosmos.documents.azure.com"
2023/08/10 05:10:45 http2: Transport encoding header ":method" = "GET"
2023/08/10 05:10:45 http2: Transport encoding header ":path" = "/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5"
2023/08/10 05:10:45 http2: Transport encoding header ":scheme" = "https"
2023/08/10 05:10:45 http2: Transport encoding header "x-ms-date" = "Thu, 10 Aug 2023 05:10:39 GMT"
2023/08/10 05:10:45 http2: Transport encoding header "authorization" = "type%3Dmaster%26ver%3D1.0%26sig%3D6kKN4RGlxyMfV3Q%2FA6avRQSxTzW9j3ggfGLf5Yzpmz0%3D"
2023/08/10 05:10:45 http2: Transport encoding header "x-ms-version" = "2020-11-05"
2023/08/10 05:10:45 http2: Transport encoding header "x-ms-cosmos-sdk-supportedcapabilities" = "1"
2023/08/10 05:10:45 http2: Transport encoding header "user-agent" = "azsdk-go-azcosmos/v0.3.5 (go1.20.7; linux)"
2023/08/10 05:10:45 http2: Transport encoding header "x-ms-documentdb-partitionkey" = "[\"c88b61d7-85aa-415f-8eda-4e73e5e811d5\"]"
2023/08/10 05:10:45 http2: Transport encoding header "accept-encoding" = "gzip"
2023/08/10 05:10:45 http2: Framer 0xc00019d420: wrote HEADERS flags=END_STREAM|END_HEADERS stream=15 len=11
2023/08/10 05:10:46 http2: Transport health check failure: context deadline exceeded
2023/08/10 05:10:46 http2: Framer 0xc00019d420: wrote RST_STREAM stream=15 len=4 ErrCode=CANCEL
2023/08/10 05:10:46 RoundTrip failure: http2: client connection lost
2023/08/10 05:10:46 http2: Transport readFrame error on conn 0xc000004000: (*net.OpError) read tcp 10.0.0.4:35104->40.78.226.1:443: use of closed network connection
2023/08/10 05:10:46 Response: ==> REQUEST/RESPONSE (Try=2/1.02727298s, OpTime=7.003358017s) -- REQUEST ERROR
   GET https://julienstestcosmos.documents.azure.com/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5
   Authorization: REDACTED
   User-Agent: azsdk-go-azcosmos/v0.3.5 (go1.20.7; linux)
   X-Ms-Cosmos-Sdk-Supportedcapabilities: REDACTED
   X-Ms-Date: REDACTED
   X-Ms-Documentdb-Partitionkey: REDACTED
   X-Ms-Version: REDACTED
   --------------------------------------------------------------------------------
   ERROR:
Get "https://julienstestcosmos.documents.azure.com/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5": http2: client connection lost
github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.(*logPolicy).Do()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/runtime/policy_logging.go:140
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/request.go:84
github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*sharedKeyCredPolicy).Do()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v0.3.5/shared_key_credential.go:120
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/request.go:84
github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.(*retryPolicy).Do()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/runtime/policy_retry.go:134
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/request.go:84
github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*headerPolicies).Do()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v0.3.5/cosmos_headers_policy.go:51
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/request.go:84
github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.telemetryPolicy.Do()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/runtime/policy_telemetry.go:66
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/request.go:84
github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.includeResponsePolicy()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/runtime/policy_include_response.go:19
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.PolicyFunc.Do()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/request.go:181
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/request.go:84
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.Pipeline.Do()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/pipeline.go:96
github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*Client).executeAndEnsureSuccessResponse()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v0.3.5/cosmos_client.go:428
github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*Client).sendGetRequest()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v0.3.5/cosmos_client.go:310
github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*ContainerClient).ReadItem()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v0.3.5/cosmos_container.go:348
main.main()
	/home/juliens/go/src/batman/h2/main.go:116
runtime.main()
	/snap/go/current/src/runtime/proc.go:250
runtime.goexit()
	/snap/go/current/src/runtime/asm_amd64.s:1598

2023/08/10 05:10:46 Retry: error Get "https://julienstestcosmos.documents.azure.com/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5": http2: client connection lost
2023/08/10 05:10:46 Retry: End Try #2, Delay=2s
2023/08/10 05:10:48 Retry: =====> Try=3
2023/08/10 05:10:48 Request: ==> OUTGOING REQUEST (Try=3)
   GET https://julienstestcosmos.documents.azure.com/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5
   Authorization: REDACTED
   User-Agent: azsdk-go-azcosmos/v0.3.5 (go1.20.7; linux)
   X-Ms-Cosmos-Sdk-Supportedcapabilities: REDACTED
   X-Ms-Date: REDACTED
   X-Ms-Documentdb-Partitionkey: REDACTED
   X-Ms-Version: REDACTED

2023/08/10 05:10:48 http2: Transport failed to get client conn for julienstestcosmos.documents.azure.com:443: http2: no cached connection was available
2023/08/10 05:10:48 http2: Transport failed to get client conn for julienstestcosmos.documents.azure.com:443: http2: no cached connection was available
2023/08/10 05:10:48 http2: Transport creating client conn 0xc000004a80 to 40.78.226.1:443
2023/08/10 05:10:48 http2: Framer 0xc00019d6c0: wrote SETTINGS len=18, settings: ENABLE_PUSH=0, INITIAL_WINDOW_SIZE=4194304, MAX_HEADER_LIST_SIZE=10485760
2023/08/10 05:10:48 http2: Framer 0xc00019d6c0: wrote WINDOW_UPDATE len=4 (conn) incr=1073741824
2023/08/10 05:10:48 http2: Transport encoding header ":authority" = "julienstestcosmos.documents.azure.com"
2023/08/10 05:10:48 http2: Transport encoding header ":method" = "GET"
2023/08/10 05:10:48 http2: Transport encoding header ":path" = "/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5"
2023/08/10 05:10:48 http2: Transport encoding header ":scheme" = "https"
2023/08/10 05:10:48 http2: Transport encoding header "x-ms-cosmos-sdk-supportedcapabilities" = "1"
2023/08/10 05:10:48 http2: Transport encoding header "user-agent" = "azsdk-go-azcosmos/v0.3.5 (go1.20.7; linux)"
2023/08/10 05:10:48 http2: Transport encoding header "x-ms-documentdb-partitionkey" = "[\"c88b61d7-85aa-415f-8eda-4e73e5e811d5\"]"
2023/08/10 05:10:48 http2: Transport encoding header "authorization" = "type%3Dmaster%26ver%3D1.0%26sig%3D6kKN4RGlxyMfV3Q%2FA6avRQSxTzW9j3ggfGLf5Yzpmz0%3D"
2023/08/10 05:10:48 http2: Transport encoding header "x-ms-date" = "Thu, 10 Aug 2023 05:10:39 GMT"
2023/08/10 05:10:48 http2: Transport encoding header "x-ms-version" = "2020-11-05"
2023/08/10 05:10:48 http2: Transport encoding header "accept-encoding" = "gzip"
2023/08/10 05:10:48 http2: Framer 0xc00019d6c0: wrote HEADERS flags=END_STREAM|END_HEADERS stream=1 len=328
2023/08/10 05:10:48 http2: Framer 0xc00019d6c0: read SETTINGS len=18, settings: MAX_CONCURRENT_STREAMS=100, INITIAL_WINDOW_SIZE=98304, MAX_HEADER_LIST_SIZE=102400
2023/08/10 05:10:48 http2: Transport received SETTINGS len=18, settings: MAX_CONCURRENT_STREAMS=100, INITIAL_WINDOW_SIZE=98304, MAX_HEADER_LIST_SIZE=102400
2023/08/10 05:10:48 http2: Framer 0xc00019d6c0: wrote SETTINGS flags=ACK len=0
2023/08/10 05:10:48 http2: Framer 0xc00019d6c0: read WINDOW_UPDATE len=4 (conn) incr=65537
2023/08/10 05:10:48 http2: Transport received WINDOW_UPDATE len=4 (conn) incr=65537
2023/08/10 05:10:48 http2: Framer 0xc00019d6c0: read SETTINGS flags=ACK len=0
2023/08/10 05:10:48 http2: Transport received SETTINGS flags=ACK len=0
2023/08/10 05:10:48 http2: Framer 0xc00019d6c0: read HEADERS flags=END_HEADERS stream=1 len=1179
2023/08/10 05:10:48 http2: decoded hpack field header field ":status" = "200"
2023/08/10 05:10:48 http2: decoded hpack field header field "date" = "Thu, 10 Aug 2023 05:10:48 GMT"
2023/08/10 05:10:48 http2: decoded hpack field header field "content-length" = "354"
2023/08/10 05:10:48 http2: decoded hpack field header field "content-type" = "application/json"
2023/08/10 05:10:48 http2: decoded hpack field header field "server" = "Compute"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-gatewayversion" = "2.0.0"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-activity-id" = "b41d7a50-310e-429d-9239-0391396271ee"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-last-state-change-utc" = "Sat, 05 Aug 2023 10:33:26.920 GMT"
2023/08/10 05:10:48 http2: decoded hpack field header field "etag" = "\"0d02a39c-0000-0100-0000-60cc12330000\""
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-resource-quota" = "documentSize=51200;documentsSize=52428800;documentsCount=-1;collectionSize=52428800;"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-resource-usage" = "documentSize=1472;documentsSize=1128749;documentsCount=1439661;collectionSize=1508347;"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-schemaversion" = "1.16"
2023/08/10 05:10:48 http2: decoded hpack field header field "lsn" = "31645147"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-request-charge" = "1"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-alt-content-path" = "dbs/julienstest/colls/Subscriptions"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-content-path" = "39cKANngRhg="
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-quorum-acked-lsn" = "31645147"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-current-write-quorum" = "3"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-current-replica-set-size" = "4"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-documentdb-partitionkeyrangeid" = "0"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-xp-role" = "1"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-global-committed-lsn" = "31645147"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-number-of-read-regions" = "0"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-item-lsn" = "13"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-transport-request-id" = "108109"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-cosmos-llsn" = "31645147"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-cosmos-quorum-acked-llsn" = "31645147"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-cosmos-item-llsn" = "13"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-session-token" = "0:-1#31645147"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-request-duration-ms" = "0.332"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-serviceversion" = " version=2.14.0.0"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-cosmos-physical-partition-id" = "86290bdc-505e-40b0-895a-ad50cec73589"
2023/08/10 05:10:48 http2: Transport received HEADERS flags=END_HEADERS stream=1 len=1179
2023/08/10 05:10:48 http2: Framer 0xc00019d6c0: read DATA stream=1 len=354 data="{\"_attachments\":\"attachments\\/\",\"_etag\":\"\\\"0d02a39c-0000-0100-0000-60cc12330000\\\"\",\"_rid\":\"39cKANngRhgBAAAAAAAAAA==\",\"_self\":\"dbs\\/39cKAA==\\/colls\\/39cKANngRhg=\\/docs\\/39cKANngRhgBAAAAAAAAAA==\\/\",\"id\":\"c88b61d7-85aa-415f-8eda-4e73e5e811d5\",\"subscription\":{" (98 bytes omitted)
2023/08/10 05:10:48 http2: Transport received DATA stream=1 len=354 data="{\"_attachments\":\"attachments\\/\",\"_etag\":\"\\\"0d02a39c-0000-0100-0000-60cc12330000\\\"\",\"_rid\":\"39cKANngRhgBAAAAAAAAAA==\",\"_self\":\"dbs\\/39cKAA==\\/colls\\/39cKANngRhg=\\/docs\\/39cKANngRhgBAAAAAAAAAA==\\/\",\"id\":\"c88b61d7-85aa-415f-8eda-4e73e5e811d5\",\"subscription\":{" (98 bytes omitted)
2023/08/10 05:10:48 http2: Framer 0xc00019d6c0: read DATA flags=END_STREAM stream=1 len=0 data=""
2023/08/10 05:10:48 http2: Transport received DATA flags=END_STREAM stream=1 len=0 data=""
2023/08/10 05:10:48 Response: ==> REQUEST/RESPONSE (Try=3/19.033188ms, OpTime=9.023936561s) -- RESPONSE RECEIVED
   GET https://julienstestcosmos.documents.azure.com/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5
   Authorization: REDACTED
   User-Agent: azsdk-go-azcosmos/v0.3.5 (go1.20.7; linux)
   X-Ms-Cosmos-Sdk-Supportedcapabilities: REDACTED
   X-Ms-Date: REDACTED
   X-Ms-Documentdb-Partitionkey: REDACTED
   X-Ms-Version: REDACTED
   --------------------------------------------------------------------------------
   RESPONSE Status: 200 OK
   Content-Length: 354
   Content-Type: application/json
   Date: Thu, 10 Aug 2023 05:10:48 GMT
   Etag: "0d02a39c-0000-0100-0000-60cc12330000"
   Lsn: REDACTED
   Server: Compute
   X-Ms-Activity-Id: REDACTED
   X-Ms-Alt-Content-Path: REDACTED
   X-Ms-Content-Path: REDACTED
   X-Ms-Cosmos-Item-Llsn: REDACTED
   X-Ms-Cosmos-Llsn: REDACTED
   X-Ms-Cosmos-Physical-Partition-Id: REDACTED
   X-Ms-Cosmos-Quorum-Acked-Llsn: REDACTED
   X-Ms-Current-Replica-Set-Size: REDACTED
   X-Ms-Current-Write-Quorum: REDACTED
   X-Ms-Documentdb-Partitionkeyrangeid: REDACTED
   X-Ms-Gatewayversion: REDACTED
   X-Ms-Global-Committed-Lsn: REDACTED
   X-Ms-Item-Lsn: REDACTED
   X-Ms-Last-State-Change-Utc: REDACTED
   X-Ms-Number-Of-Read-Regions: REDACTED
   X-Ms-Quorum-Acked-Lsn: REDACTED
   X-Ms-Request-Charge: REDACTED
   X-Ms-Request-Duration-Ms: REDACTED
   X-Ms-Resource-Quota: REDACTED
   X-Ms-Resource-Usage: REDACTED
   X-Ms-Schemaversion: REDACTED
   X-Ms-Serviceversion: REDACTED
   X-Ms-Session-Token: REDACTED
   X-Ms-Transport-Request-Id: REDACTED
   X-Ms-Xp-Role: REDACTED

2023/08/10 05:10:48 Retry: response 200
2023/08/10 05:10:48 Retry: exit due to non-retriable status code
2023/08/10 05:10:48 Obtained response with activity ID: 0a6c444d-f0a5-4987-b101-d7ae8c8f49ba

I’m able to see the connection been recover by the transport along with the runtime retrier policy 😃

At least for cosmos db client, should these setting be enabled by default ?

I did not test this for ARM RPs, is ARM supporting HTTP/2 ? Or maybe doing a better job to close connection ?

  • Anything we should know about your environment.

Tested this on a vanilla Azure VM, but same issue is happening code hosted on AKS.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 24 (16 by maintainers)

Most upvoted comments

Fix will be including in November’s release of azcore@v1.9.0

I don’t think any http server that support http2 would not have support ping frames. It’s part of the spec.

You’ll be surprised the number of servers I saw in the past not supporting ping frames lol.

But all Azure Endpoints should respect the RFC. This should be defaulted, if there is an issue with it, the teams owning the DP should fix it, not the client/caller.

And to be clear, the root cause is a bug in the Go standard library.

https://github.com/golang/go/issues/59690

What we’re evaluating here is a workaround until the above has been fixed.

@julienstroheker @jim-minter @sgmiller is there a particular time of day you see the timeouts occur more frequently than others? In our case, for Cosmos DB in particular, we always see connections being dropped or going bad around 23:00 UTC.

this is random for us and happening more often on busier region

If we want to make it configurable, we have a few options.

  • if it’s off by default, seems like it ends up being a knob that everybody will want to turn on
    • in addition, if/when we decide to turn it on by default, we have this vestigial configuration knob (changing the default to true is problematic as the zero-value is false)
  • we could make it on by default, with an option to disable it if there’s a problem
  • or we could make it opt-in via a build tag, but that’s not very discoverable
    • the idea being here that we could eventually remove the build tag and make it on by default once we gain confidence

Conceptually, it seems like this is something we’d always want to enable, and I have a hard time believing that we have Azure hosts that don’t support HTTP/2 ping.

I suppose a badly implemented http/2 server could not implement ping, but it should really not be in production hosting an azure service then? 😄

http2 spec defines 10 frames : https://httpwg.org/specs/rfc7540.html#rfc.section.11.2

I hope servers that claim supporting http/2 support the core 10 frames. There is nothing stating that they are optional

Note that the workaround might make things worse if running in an environment where ICMP is blocked (e.g. firewall).

Why would that be the case? Note: an HTTP/2 ping frame != an ICMP ping.

Thanks @ealsur, you’re quite right, I read this too superficially when triaging it