azure-sdk-for-go: [azservicebus] Receiver indefinitely "stuck" after a long idle period

Bug Report

After a long period of inactivity, a receiver will stop receiving new messages. A “long period” is sometime more than 8 hours and less than 13 days, but exactly how long is unknown.

(The problem was originally brought up in this comment)

This is very straightforward to demonstrate if you are willing to wait and setup a dedicated service bus instance. It occurs frequently with infrastructure used for QA, since that will often not receive any activity over weekends and holidays.

SDK Versions Used

I have seen this behavior across many versions of the azure-sdk-for-go, but the most recent test was conducted using these versions:

github.com/Azure/azure-sdk-for-go/sdk/azcore v1.1.0
github.com/Azure/azure-sdk-for-go/sdk/azidentity v1.1.0
github.com/Azure/azure-sdk-for-go/sdk/messaging/azservicebus v1.0.1

About the most recent time this was reproduced

We most recently reproduced this by running a small golang app in an AKS cluster using a managed identity assigned by aad-pod-identity.

In this test, we setup a dedicated azure service bus + managed identity (terraform below) and let the app run. After 13 days, we came back to it . No errors had been emitted, just the regular startup message for the app. I then entered a message into the bus. The receiver in the app did not pickup the message after waiting 30 minutes. We deleted the pod running the app and allowed it to be recreated by the deployment. The replacement pod immediately picked up the message.

Workaround

We can work around this issue by polling for messages using a 10 minute timeout and restarting in a loop. Our workaround looks like this and is known to work for weeks without an issue.

rcvrCtxForSdkWorkaround, canceller := context.WithTimeout(ctx, 10*time.Minute)
messages, err := azsbReceiver.ReceiveMessages(rcvrCtxForSdkWorkaround, 1, nil)
canceller()
if err != nil && !errors.Is(err, context.DeadlineExceeded) {
	r.logger.Info(EvtNameErrRetrievingMsgs, map[string]string{
		"error": err.Error(),
	})
	continue
}
// This just means that the context was closed before any messages
// were picked up. This could have been the context (master or)
// rcvrCtxForSdkWorkaround, so a loop is required.
if len(messages) == 0 {
	continue
}

The terraform for the test bus

The terraform below was used to setup the test bus and assign the app identity access to it.

resource "azurerm_servicebus_namespace" "ns" {
  name                = "fixturens${local.deploy_token}"
  location            = local.location
  resource_group_name = data.azurerm_resource_group.rg.name
  sku                 = "Standard"
  tags                = local.standard_tags
}

resource "azurerm_servicebus_queue" "test" {
  name                                 = "test"
  namespace_id                         = azurerm_servicebus_namespace.ns.id
  dead_lettering_on_message_expiration = false
  enable_partitioning                  = false
  default_message_ttl                  = "PT48H"
}

resource "azurerm_role_assignment" "full_access_ra" {
  for_each     = local.authorized_authorized_principal_ids_as_map
  scope        = azurerm_servicebus_queue.test.id
  principal_id = each.value
  role_definition_name = "Azure Service Bus Data Owner"
}

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 27 (11 by maintainers)

Commits related to this issue

Most upvoted comments

Reopening as there’s still work for this.

Hi all, just want to update with some info.

There have been a lot of fixes in the latest package release to improve reliability/resilience to azservicebus: azservicebus@v1.3.0.

In this release, we incorporated the GA version of go-amqp (the underlying protocol stack for this package). A lot of the work that went into this was around deadlocks and race conditions within the library that could cause it to become unresponsive. In addition there have been service-side fixes related to messages not being delivered, despite being available.

Every release brings improvements but this one hits at the core of the stack and should yield improvements in overall reliability, especially in some potential corner cases with connection/link recovery.

I’d encourage all the people involved in this thread to upgrade.

@richardpark-msft. I will roll out the new library version in our lower envs and let it run for a bit. However, I’m not going to roll it out to anywhere in our “live paths”. While validation of this type of thing isn’t nearly as useful without real traffic, the work around I have in place is stable and I don’t want to introduce risk there.

Appreciate all the testing you’ve done and your workaround is harmless. It should work just fine either way. It would be interesting, in production code, if you still see your workaround trigger - at that point we’d want to involve the service team to see if there’s something interesting happening there, instead of just focusing on the client SDK.

In the original description, I included some terraform and a description of how reproduce the issue. Has this library been tested against that use case or some comparable one?

We run three different kinds of tests: unit, live/integration and long-term.

We’ve added a lot of tests in all three areas based on feedback from you (and others) to see if we could get to the bottom of it. We definitely found and fixed bugs, but I never replicated the ease of your scenario despite trying a lot of variations.

However, I do have your scenario covered here: https://github.com/Azure/azure-sdk-for-go/blob/main/sdk/messaging/azservicebus/internal/stress/tests/infinite_send_and_receive.go. We generally run these tests for a week to give them space to fail and we have a few other tricks using chaos-mesh to try to induce failures earlier.

(another one inspired by some bugs: https://github.com/Azure/azure-sdk-for-go/blob/main/sdk/messaging/azservicebus/internal/stress/tests/mostly_idle_receiver.go)

@rokf, the approach you have is basically eating the cost of starting up a TCP connection each time. You can improve this a few ways:

  • Use a single azservicebus.Client instance. The Client owns the expensive part (the TCP connection) so keeping that around and using it when creating your Senders will make things faster.
  • If you want you can also go further and use a single Sender instance as well - this’ll have the same effect but also cache the link, which does have some startup cost as well.

Now, in either case, it’s possible for the connection to have to restart if you idle out, so there might still be some initialization cost based on network conditions, etc… But the tactics above are the easiest ways to avoid having to eat the startup cost on every send.

@alesbrelih, sorry I missed that you’ve provided details above. I’m following up on a fix that was made on the service-side that matches what you were seeing, in that you have valid credits on the link but no messages are being returned. I’ll post back here when I have details.

@richardpark-msft Hey Richard, we’ve noticed that a new version has been released - https://github.com/Azure/azure-sdk-for-go/tree/sdk/messaging/azservicebus/v1.2.0. Do you suggest that we switch to that version?

I always do, but I’m a bit biased. 😃 I’m still working on adding some resiliency to see if we can help more with a part of this situation. From what I can tell we’re getting into a situation where the link hasn’t re-issued credits (which tell Service Bus to send us more messages). I have a fix in the works for that, but I’m still going over ramifications for it.

The bug fix I added for this release was to make it so our internal Close() function would time out - prior to this it could actually hang for a long time and if it was cancelled it could leave things in an inconsistent state. This can affect things even if you aren’t specifically calling Close() since we will close things internally during network recovery.

So yes, definitely recommend. There are some additional logging messages now that show if the code path is being triggered and is remediating the issue (note, these are always subject to change in the future):

Connection reset for recovery instead of link. Link closing has timed out.
Connection closed instead. Link closing has timed out.

If your client was stuck or seems to disappear during the recovery process this could be the root cause.

@alesbrelih, can you generate a log for your failures? I’m working on a few fixes - the first one that should come out in the next release will introduce a timeout when we close links, which can hang indefinitely in some situations.

https://github.com/Azure/azure-sdk-for-go/blob/main/sdk/messaging/azservicebus/README.md#troubleshooting

@rokf - We are not, we keep the same receiver for the lifetime of the application execution.

It may be helpful to know that the service using this workaround doesn’t processing many messages yet. In our highest volume environment, a given instance will receive less than 500 messages a day and our peak messages per second is like 2-3. We wouldn’t have seen any intermitted issues that come up during even moderate load yet.

👋 @rokf - I am still using this work around. The workaround has been stable for the past several months, so we haven’t touched it. I picked 10 minutes arbitrarily and we’ve not tried a lower time (though I’m sure that would work).

However… @richardpark-msft it’s important to note that just last Wednesday (01/18), we might have experienced the issue in our prod environment. We had a situation where we again we stopped receiving messages and had no errors emitted from the SDK. This time, unlike others, our Azure Service Bus UserError metric started climbing, indicating some interaction with the bus. Restarting the app caused messages to start being picked up again.

We’re adding more debugging to try to figure out what on earth happened, but that may have been this same problem showing up after 6+ months with the work around. The lack of visibility into the problem makes it very hard to tell.

Hey @tkent , I added in a client-side idle timer that does something similar to what you outlined above. It recycles the link if nothing is received for 5 minutes, under the covers. It was released in azservicebus 1.1.2

Closing this now as we’ve formally implemented something similar to your workaround 😃.

This should help combat a situation I’ve been worried about for a bit. If the server idles out our link or detaches it and we miss it then our link will still look alive during these quiet times, even if it’s never going to work. We now close out the link and attempt to recreate it, which will force a reconciling between the service and client.

Hi @tkent, thank you for filing this issue. I know it’s frustrating to deal with a bug, so I appreciate you working with me on this.

We have tests for these kinds of scenarios but clearly, since you’re seeing a bug, I’m missing something. I’ll see what I’m mising there.