components-contrib: PubSub Azure Service Bus periodically stops working due to authentication issues
Expected Behavior
Continue to operate normally
Actual Behavior
Periodically our pubsub with Azure Service Bus stops working due to authentication issues. We have this issue on multiple clusters where (it looks like) our Acceptance cluster basically has it on the same day as our Production cluster (probably because we accept Acceptance and roll through Production on the same day)
AFAICT, it happened on
- August 3th
- August 25th
- September 15th
It seems to be a pattern that repeats every 20-some days.
The symptoms are that we aren’t receiving any more messages. The cause is seen in the dapr sidecar logging. Basically an authentication issue as it seems.
2021-09-15T05:52:09.809Z time="2021-09-15T05:52:09.809304727Z" level=error msg="azure service bus error: error receiving message on topic REDACTED, link detached, reason: *Error{Condition: amqp:link:detach-forced, Description: The link 'SfpRLmiAbJgr9e1tqh8pHkZ7xDDkRJaTc4bplkN9nlk2DPnoV_iHbQ' is force detached. Code: RenewToken. Details: Unauthorized access. 'Listen' claim(s) are required to perform this operation. Resource: 'REDACTED'.. TrackingId:62d462fb5fd548e29418a58cdebbb11d_G24, SystemTracker:gateway7, Timestamp:2021-09-15T05:52:09, Info: map[]}" app_id= REDACTED instance= REDACTED scope=dapr.contrib type=log ver=1.2.2 2021-09-15T05:52:09.809Z time="2021-09-15T05:52:09.809333928Z" level=warning msg="Subscription to topic REDACTED lost connection, attempting to reconnect... [0/30]" app_id= REDACTED instance= REDACTED scope=dapr.contrib type=log ver=1.2.2 2021-09-15T06:57:03.912Z time="2021-09-15T06:57:03.911865689Z" level=warning msg="azure service bus error: closing subscription entity for topic REDACTED: link detached, reason: *Error{Condition: amqp:link:detach-forced, Description: The link 'QLcCY6xfo57kXvlhZB0XJv4nz7jPLoNyJKy2PP6KO7N4gGlLdM9aEg' is force detached. Code: RenewToken. Details: Unauthorized access. 'Listen' claim(s) are required to perform this operation. Resource: 'REDACTED'.. TrackingId:120749e74bd6464a8f7efb733b0758c6_G2, SystemTracker:gateway7, Timestamp:2021-09-15T06:57:03, Info: map[]}" app_id= REDACTED instance=REDACTED scope=dapr.contrib type=log ver=1.2.2
Steps to Reproduce the Problem
Wait for it to happen every 20-ish days
Additional Info
dapr 1.2.2 deploy in AKS (1.20.9) in HA mode using Helm chart reporting healthy
Servicebus Standard tier with global shared access policy for dapr with manage send and listen permission
pub sub component:
apiVersion: dapr.io/v1alpha1
kind: Component
metadata:
annotations:
meta.helm.sh/release-name: REDACTED
meta.helm.sh/release-namespace: REDACTED
creationTimestamp: "2021-06-18T13:18:11Z"
generation: 1
labels:
app.kubernetes.io/managed-by: Helm
name: dapr-pubsub-component
namespace: REDACTED
resourceVersion: "81954849"
uid: 4f232902-bb98-470e-9167-2a64ec19c2df
spec:
metadata:
- name: connectionString
value: Endpoint=sb://REDACTED.servicebus.windows.net/;SharedAccessKeyName= REDACTED;SharedAccessKey= REDACTED
- name: timeoutInSec
value: "60"
- name: handlerTimeoutInSec
value: "60"
- name: disableEntityManagement
value: "false"
- name: maxDeliveryCount
value: ""
- name: lockDurationInSec
value: ""
- name: lockRenewalInSec
value: ""
- name: maxActiveMessages
value: ""
- name: maxActiveMessagesRecoveryInSec
value: ""
- name: maxConcurrentHandlers
value: ""
- name: prefetchCount
value: ""
- name: defaultMessageTimeToLiveInSec
value: ""
- name: autoDeleteOnIdleInSec
value: ""
type: pubsub.azure.servicebus
version: v1
Originally mentioned in: #874
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 23 (7 by maintainers)
@bgelens we plan to start work on resiliency for all building blocks in 1.6. At this point, I would recommend to have retry logic to publish on the app.
This is the issue to tackle this: https://github.com/dapr/dapr/issues/3586
@jjcollinge I’ve upgraded 2 of our clusters to 1.5. Our production remains on 1.4.2 for now which allows us to compare.
We did start to notice issues occur on the publishing side as well, this is where we loose data with 1.4 as they are not retried and currently we did not expect the need to handle this.
Do you know if work in the 1.5 release maybe addresses this? We are thinking about adding our own retry logic by catching the daprexception but afaia this would be something dapr should do for us (right?)
I got the same issue but it only happens when 1.7 RC.1 has no HA enabled
time="2022-04-07T23:57:03.828812055Z" level=warning msg="azure service bus error: closing subscription entity for topic dapr_xyz: link detached, reason: *Error{Condition: amqp:link:detach-forced, Description: The link 'oCp6GmgdlT2fQ987trzcFHJMXNDaK8TlqKUru4bpYTOoj7g--xgS_A' is force detached. Code: RenewToken. Details: Unauthorized access. 'Listen' claim(s) are required to perform this operation. Resource: 'sb://bla-bla-bla.servicebus.windows.net/dapr_xyz/subscriptions/yuhu'.. TrackingId:0e7ce83b2b9548609fb141cd50048c04_G5, SystemTracker:gateway7, Timestamp:2022-04-07T23:57:03, Info: map[]}" app_id=abc instance=abc-deployment-9bc894c66-chksp scope=dapr.contrib type=log ver=1.7.0-rc.1That’s great to hear, thanks for the feedback @bgelens and @kalpchipsoft. Closing this issue for now.