azure-sdk-for-python: [EventHub] ErrorCodes.UnknownError: Connection in an unexpected error state.

We are running live Tests against other clouds like US Gov and Azure China Cloud. The goal is to check whether new azure sdk package work with other clouds or not.

Error Description: When running the test test_send.py::test_send_with_partition_key and its async test in windows2019_36 on China cloud, it runs failed and the error message is shown as following, for more details please check here: image

Expected Behavior: Test test_send.py::test_send_with_partition_key passed in windows2019_36 on China cloud. In the local test, the probability of passing is 50%.

@benbp , @jameszliao-msft , @lmazuel , @lilyjma , @ramya-rao-a and @annatisch for notification.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 24 (24 by maintainers)

Most upvoted comments

@v-xuto Thanks for testing this out! I was actually able to reproduce this error locally when testing against an EH namespace where the location is set far away and has a long round trip (specifically, Australia). I’m currently debugging right now to see if we’re retrying incorrectly or whether there’s a problem elsewhere. Will have more updates to you in the next week.

Hey @v-xuto! Just wanted to give you an update: I believe that the issue had to do with our retry logic in our test. I have a PR out to fix this and am working through a few comments on the PR. Hoping to have this merged by next week. Thanks @mikeharder for your suggestions on retry and reproducing with location set to Australia!

More details: We were using the underlying uamqp ReceiveClient to receive in the test (which doesn’t have the retry logic), rather than the EH Consumer Client (which has the retry logic). So, this should only be an issue in our test and not an issue with our SDK. Once I added the retry logic to the test, I was no longer getting this error.

It doesn’t reproduce anymore, we’ll close it, thanks!

Hey @v-xuto - Really sorry about the delay! I had to prioritize a few other things, unfortunately. However, I was able to reproduce this error that you were running into on a Linux VM. Strangely enough, I know that before I made an update to this test, it was passing on Linux but not Windows. I need to dig a little more to find the source of the error, before I can fix this.

Thanks for checking in! This is definitely on my TODO list, and I will keep you updated on any progress.

@swathipil Is there any progress on the fix of this issue? If there is any progress, please keep us informed. Thanks a lot.

@swathipil I have renew a eventhub weekly pipeline by your PR https://github.com/Azure/azure-sdk-for-python/pull/24014. But it only runs on the Public Cloud. Pipeline results: https://dev.azure.com/azure-sdk/internal/_build/results?buildId=1506942&view=results

If you want test weekly pipeline by your PR, include the three clouds(public, china and usgov ). You need to refer to PR https://github.com/Azure/azure-sdk-for-python/pull/21715 and make changes in your PR https://github.com/Azure/azure-sdk-for-python/pull/24014 accordingly.

@swathipil I have renew a eventhub weekly pipeline, but it still failed in China cloud. Error: TimeoutError: Authorization timeout. Failed test: test_send.py::test_send_with_partition_key, test_send_async.py::test_send_with_partition_key_async Test EventHub PR: https://github.com/Azure/azure-sdk-for-python/pull/21715, and I have rebased the latest code. More details click here: https://dev.azure.com/azure-sdk/internal/_build/results?buildId=1498072&view=results

Hey @v-xuto! I just merged the potential fix for this issue. (Sorry about the long wait time on this!) Would you be able to re-run this test on the China cloud and let me know if the test passes? Thanks!

@swathipil By increasing the timeout time, the pass rate does not improve. This makes me think whether the test has much to do with timeout. I also tried to debug the test, waiting 1~2 minutes when running to this line of code received = partition.receive_message_batch(timeout=5000), but the test still failed.

@swathipil: How much slower is Windows, and what is the scenario? Perf should be similar across Windows and Linux for most scenarios.

Our SDK and tests should be reliable even if the client and service are in distant regions. If we are seeing test failures in distant regions, to me this seems like a bug in either the SDK or the tests that should be investigated and fixed. Moving the client to a closer region might hide a real issue that needs to be fixed.

If you can’t have it in China (different permissions level from the US), there are many location in south korea, india, east asia,that might be able to simplify your scenario (just saying if it helps)

We would need to spin up a separate agent pool and then configure the test matrix entries to reference that pool. I’ll start a conversation with you and @mikeharder.

@swathipil do you think we should consider running these tests on agent VMs located in one of the asia regions?

@swathipil I have updated timeout value to 10000. But its result is still Fail. For more details in pipeline please check here.

Hi @v-xuto - I think this may take a little more digging into. However, @yunhaoling mentioned that this error may be happening due to poor connection because of the long distance from US to China. Can you try increasing the timeout value to 10000 (or something greater) and seeing if that improves test passing rate?