azure-sdk-for-go: azcosmos occasionally returning 403 Forbidden Connection insufficiently secured

Bug Report

  • pakage : /sdk/data/azcosmos
  • SDK version: latest
  • go version: 1.18.6

Every now and then, a call to cosmos db will return 403 Forbidden, with a message that the client did not use the minimum TLS version:

GET https://XXXXX.documents.azure.com:443/dbs/xxx/colls/xxx/docs/1822b18eb972595bda1b797b332dff1b11567aaaba936ce75824bc0fefdd282e
--------------------------------------------------------------------------------
RESPONSE 403: 403 Forbidden
ERROR CODE: Forbidden
--------------------------------------------------------------------------------
{
"code": "Forbidden",
"message": "Connection is insufficiently secured. Please use Tls SSL protocol or higher\r\nActivityId: adb35793-a4ca-481d-9cc6-dd5d0adf8eb5, documentdb-dotnet-sdk/2.14.0 Host/64-bit MicrosoftWindowsNT/10.0.17763.0"
}
--------------------------------------------------------------------------------
, Dependency: Microsoft.DocumentDB, OriginError: GET https://xxxx.documents.azure.com:443/dbs/xxx/colls/xxxx/docs/1822b18eb972595bda1b797b332dff1b11567aaaba936ce75824bc0fefdd282e
--------------------------------------------------------------------------------
RESPONSE 403: 403 Forbidden
ERROR CODE: Forbidden
--------------------------------------------------------------------------------
{
"code": "Forbidden",
"message": "Connection is insufficiently secured. Please use Tls SSL protocol or higher\r\nActivityId: adb35793-a4ca-481d-9cc6-dd5d0adf8eb5, documentdb-dotnet-sdk/2.14.0 Host/64-bit MicrosoftWindowsNT/10.0.17763.0"
}
--------------------------------------------------------------------------------

Note that the error message includes a reference to the .net document db SDK. We do not use .net, we receive this error when talking to cosmosdb from the golang SDK.

golang 1.18+ defaults to minTLS 1.2 in the http stack, and we explicitely set it anyways.

almost all calls from that same client succeed, and a few calls a day fail with 403 without any changes in configuration. this happens in all regions since early december.

  • What did you expect or want to happen?

no min TLS error

  • How can we reproduce it?

run a service that issues calls to a DB continuously, it will hit this error.

  • Anything we should know about your environment.

I don’t think there is anything special about our environment.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 2
  • Comments: 42 (26 by maintainers)

Commits related to this issue

Most upvoted comments

Update from the Azure Cosmos issue raised via support - they tell me that some configuration changes have been made on the Cosmos/Gateway side and continuing to monitor for 403 errors. I suspect you may have further insight on this @ealsur but mentioning just in case you’re not on the internal chain of comms.

From our Production service logs, I’m happy to report that we’ve had no further “retries for 403 errors” attempted for the last 12 hours now which means we’re no longer experiencing the issue retries were put in place for. We’ll be monitoring through the day, but given that we’ve consistently had at least 3 or 4 per hour over the last two weeks, to have had zero in 12 hours sounds like tremendous progress.

Fingers crossed it stays that way - though I would be interested to know what the config change may have entailed.

@ealsur The activity id for one that happened at 00:09 UTC today was ecea9ef6-b5a3-4119-b0d6-b3ec3f1040eb.

The full error was [*exported.ResponseError]: GET https://xxxx.documents.azure.com:443/dbs/xxx/colls/xxx/docs/xxx -------------------------------------------------------------------------------- RESPONSE 403: 403 Forbidden ERROR CODE: Forbidden -------------------------------------------------------------------------------- { "code": "Forbidden", "message": "Connection is insufficiently secured. Please use Tls SSL protocol or higher\r\nActivityId: ecea9ef6-b5a3-4119-b0d6-b3ec3f1040eb, documentdb-dotnet-sdk/2.14.0 Host/64-bit MicrosoftWindowsNT/10.0.17763.0" }

Friday dec 9th is the first time we see it In the logs. no config changes or go version changes i can find. ink fact no deploys that week at all with the early holidays.

i did update to the newest version of the sdk today and am testing it now and think ~i’m not seeing it anymore… haven’t sent it to prod yet though.~ as soon as i typed that i had failures show up. i jinxed it.

BTW, here is an activityId from a recent failure, i am hoping you have the ability to look up server side logs based on that? 9c2bceab-0458-4ac4-9620-cd31993bbc55

Could this be related: #19469 (comment)

@ealsur the related issue seems unlikely to me. the call works, just not all the time. i did a loop upserting the same document and out of 100 calls 20 of them threw this error. if it was a config issue i would expect none of them to work.

ps this also just Started recently looking at our logs.

I went and reached out to the team to get more clarity and details about the situation to share here. The team has identified a fix related to HTTP2 protocol, and it’s in the final stages of testing. Once that is completed, it will rollout to all Gateway endpoints but there is no fixed date for this yet as testing is underway (and we know based on the experience on this thread how hard it might be to test/replicate). Will share more updates as it progresses.

This issue has been fixed on the service and the fix has been widely deployed. Please re-open if the issue is still happening or arises again.

For the sake of completeness, we have this morning deployed a new version of our microservice that has a retry option for 403 responses specifically from state failures (gets, puts or posts) linked to this Go HttpClient.

We have since had two failed GETs (ActivityId: 918c25fa-cb80-413a-8d57-52358bbfe31b and f776b2d2-3b96-473b-b69e-b4db4f6a3641) within a few minutes of each other. The subsequent retries of both which happened milliseconds after each request, succeeded without any issue, and as such, we have avoided the subsequent data corruption for these calls.

Of course as has been mentioned on this thread before, this is not ideal, but at least it seems we are not suffering the consequences now - I will update if that changes.

@ealsur I hope those failed ActivityIds can be of some assistance to the internal team, given that an identical request directly afterward succeeded which should hopefully allow the team some guidance on reproducing the issue.

Update on this, I have raised an issue via our account manage with the Azure Cosmos product team - still awaiting an update from them as to the cause.

We’ve also since upgraded dapr to the latest (v1.9.5) on Kubernetes control plane and the dapr client within the docker images running on there. I’m not sure if there is any later version of the Go Azure Cosmos lib being used, but at least everything of ours is on the latest versions.

The problem still persists getting errors a few every hour with thousands of the same calls successful in between - thereby slowly corrupting our data over time. We’re using manual data fixes in the mean time, but this isn’t tenable over the medium-term.

I’ll update when I have more from the Azure product team, but I’m not hopeful it can be rectified with a code fix from the Go side as the default behaviour is to use TLS 1.2 at least which is what the error is saying is not happening.

Any insights I have missed or could try further would be much appreciated.