dapr: Publish to service bus fails with i/o timeout after a couple of ms and requires pod restart

In what area(s)?

/area runtime

What version of Dapr?

1.6.0

Expected Behavior

Publish operations should work flawlessly

Actual Behavior

Some pods suddenly get into a “defunct” state when it comes to publishing messages (other pods publish messages to the same topics at the same time without any problems). The error message is on the form: time="2022-03-16T11:37:43.267942204Z" level=debug msg="rpc error: code = Internal desc = error when publish to topic <topic> in pubsub <pubsub>: read tcp <pod ip>:<port>-><sb pe ip>:<port>: i/o timeout" app_id=<deployment> instance=<pod name> scope=dapr.runtime.grpc.api type=log ver=1.6.0. “sb pe ip” is the Service bus private endpoint IP. Please also note that this is logged with “debug” log level. I suppose that should be changed.

One recent example is one pod that was running from 14th of March 07:17 to 16th of March 17:09. There are no logs in Dapr when messages are published successfully, but we have no reason to believe that there were any issues with publishing messages on 14th and 15th of March. Suddenly on 16th of March 12:37, we start to see the error. All publishes from the pod fails until it is deleted at 17:09. Again, we have no logs on successful publications, but our experience from live events that we have had with the same error have shown that the dapr sidecar consistently fails to publish messages until the pod is deleted. The error message speaks of an “i/o timeout”, but we see in all cases that the error occurs that we get the error after 2-5ms of processing. So not much time for a timeout to occur.

This bug seems to be similar to https://github.com/dapr/dapr/issues/4031, but it has a different error message. We have reviewed the code changes in that bug but it does not seem relevant in our case since the error that we observe occurs in the Dapr sidecar and not in our container.

Steps to Reproduce the Problem

Unclear, suddenly, some pods start to experience this error. We have two pods per deployment, and the other pod works OK, so half of the requests are failing.

Release Note

RELEASE NOTE: FIX Intermittent i/o timeout error in Dapr sidecar.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 38 (17 by maintainers)

Most upvoted comments

Thanks for letting us know. I’m on vacation but I’ll look at this once I’m back