components-contrib: PubSub Azure Service Bus periodically stops working due to authentication issues

Expected Behavior

Continue to operate normally

Actual Behavior

Periodically our pubsub with Azure Service Bus stops working due to authentication issues. We have this issue on multiple clusters where (it looks like) our Acceptance cluster basically has it on the same day as our Production cluster (probably because we accept Acceptance and roll through Production on the same day)

AFAICT, it happened on

  • August 3th
  • August 25th
  • September 15th

It seems to be a pattern that repeats every 20-some days.

The symptoms are that we aren’t receiving any more messages. The cause is seen in the dapr sidecar logging. Basically an authentication issue as it seems.

2021-09-15T05:52:09.809Z time="2021-09-15T05:52:09.809304727Z" level=error msg="azure service bus error: error receiving message on topic REDACTED, link detached, reason: *Error{Condition: amqp:link:detach-forced, Description: The link 'SfpRLmiAbJgr9e1tqh8pHkZ7xDDkRJaTc4bplkN9nlk2DPnoV_iHbQ' is force detached. Code: RenewToken. Details: Unauthorized access. 'Listen' claim(s) are required to perform this operation. Resource: 'REDACTED'.. TrackingId:62d462fb5fd548e29418a58cdebbb11d_G24, SystemTracker:gateway7, Timestamp:2021-09-15T05:52:09, Info: map[]}" app_id= REDACTED instance= REDACTED scope=dapr.contrib type=log ver=1.2.2 2021-09-15T05:52:09.809Z time="2021-09-15T05:52:09.809333928Z" level=warning msg="Subscription to topic REDACTED lost connection, attempting to reconnect... [0/30]" app_id= REDACTED instance= REDACTED scope=dapr.contrib type=log ver=1.2.2 2021-09-15T06:57:03.912Z time="2021-09-15T06:57:03.911865689Z" level=warning msg="azure service bus error: closing subscription entity for topic REDACTED: link detached, reason: *Error{Condition: amqp:link:detach-forced, Description: The link 'QLcCY6xfo57kXvlhZB0XJv4nz7jPLoNyJKy2PP6KO7N4gGlLdM9aEg' is force detached. Code: RenewToken. Details: Unauthorized access. 'Listen' claim(s) are required to perform this operation. Resource: 'REDACTED'.. TrackingId:120749e74bd6464a8f7efb733b0758c6_G2, SystemTracker:gateway7, Timestamp:2021-09-15T06:57:03, Info: map[]}" app_id= REDACTED instance=REDACTED scope=dapr.contrib type=log ver=1.2.2

Steps to Reproduce the Problem

Wait for it to happen every 20-ish days

Additional Info

dapr 1.2.2 deploy in AKS (1.20.9) in HA mode using Helm chart reporting healthy

Servicebus Standard tier with global shared access policy for dapr with manage send and listen permission

pub sub component:

apiVersion: dapr.io/v1alpha1
kind: Component
metadata:
  annotations:
    meta.helm.sh/release-name: REDACTED
    meta.helm.sh/release-namespace: REDACTED
  creationTimestamp: "2021-06-18T13:18:11Z"
  generation: 1
  labels:
    app.kubernetes.io/managed-by: Helm
  name: dapr-pubsub-component
  namespace: REDACTED
  resourceVersion: "81954849"
  uid: 4f232902-bb98-470e-9167-2a64ec19c2df
spec:
  metadata:
  - name: connectionString
    value: Endpoint=sb://REDACTED.servicebus.windows.net/;SharedAccessKeyName= REDACTED;SharedAccessKey= REDACTED
  - name: timeoutInSec
    value: "60"
  - name: handlerTimeoutInSec
    value: "60"
  - name: disableEntityManagement
    value: "false"
  - name: maxDeliveryCount
    value: ""
  - name: lockDurationInSec
    value: ""
  - name: lockRenewalInSec
    value: ""
  - name: maxActiveMessages
    value: ""
  - name: maxActiveMessagesRecoveryInSec
    value: ""
  - name: maxConcurrentHandlers
    value: ""
  - name: prefetchCount
    value: ""
  - name: defaultMessageTimeToLiveInSec
    value: ""
  - name: autoDeleteOnIdleInSec
    value: ""
  type: pubsub.azure.servicebus
  version: v1

Originally mentioned in: #874

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 23 (7 by maintainers)

Most upvoted comments

@bgelens we plan to start work on resiliency for all building blocks in 1.6. At this point, I would recommend to have retry logic to publish on the app.

This is the issue to tackle this: https://github.com/dapr/dapr/issues/3586

@jjcollinge I’ve upgraded 2 of our clusters to 1.5. Our production remains on 1.4.2 for now which allows us to compare.

We did start to notice issues occur on the publishing side as well, this is where we loose data with 1.4 as they are not retried and currently we did not expect the need to handle this.

Publish operation failed: the Dapr endpoint indicated a failure. See InnerException for details. Status(StatusCode="Internal", Detail="error when publish to topic mytopic in mycomponent: read tcp podprivateip:52148->servicebuspublicip:5671: read: connection reset by peer") 

Do you know if work in the 1.5 release maybe addresses this? We are thinking about adding our own retry logic by catching the daprexception but afaia this would be something dapr should do for us (right?)

I got the same issue but it only happens when 1.7 RC.1 has no HA enabled time="2022-04-07T23:57:03.828812055Z" level=warning msg="azure service bus error: closing subscription entity for topic dapr_xyz: link detached, reason: *Error{Condition: amqp:link:detach-forced, Description: The link 'oCp6GmgdlT2fQ987trzcFHJMXNDaK8TlqKUru4bpYTOoj7g--xgS_A' is force detached. Code: RenewToken. Details: Unauthorized access. 'Listen' claim(s) are required to perform this operation. Resource: 'sb://bla-bla-bla.servicebus.windows.net/dapr_xyz/subscriptions/yuhu'.. TrackingId:0e7ce83b2b9548609fb141cd50048c04_G5, SystemTracker:gateway7, Timestamp:2022-04-07T23:57:03, Info: map[]}" app_id=abc instance=abc-deployment-9bc894c66-chksp scope=dapr.contrib type=log ver=1.7.0-rc.1

That’s great to hear, thanks for the feedback @bgelens and @kalpchipsoft. Closing this issue for now.