azure-sdk-for-js: Download from Blob storage gets stuck sometimes and never completes

  • Package Name: Azure/azure-storage-node
  • Package Version: v12.9.0
  • Operating system: Ubuntu 20.04.4 LTS
  • nodejs
    • version: v16.13.0
  • browser
    • name/version: NA
  • typescript
    • version:
  • Is the bug related to documentation in

Describe the bug Download from blob storage fails to complete and gets stuck for very long time. Here’s an example run for the same (with info logs from SDK)- https://github.com/kotewar/cron-action-test-download-bug/runs/6935926824?check_suite_focus=true

Screenshot 2022-06-20 at 11 11 10 AM

This doesn’t happen always but once in a while it is getting stuck for many users. There are many issues raised regarding the same by the users of Actions/Cache for the same.

References-

To Reproduce Steps to reproduce the behavior: As this is an intermittent issue, this can be reproduced by having a github action scheduled that creates a cache huge in size. This workflow file can be used for the same.

Expected behavior The download should complete 100% everytime and not get stuck intermittently.

Screenshots If applicable, add screenshots to help explain your problem. Screenshot 2022-06-20 at 11 11 10 AM

Additional context Same file gets downloaded by multiple runners most of the times. And this issue is mostly seen when the same file is getting downloaded in parallel from Azure Blob Storage.

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 14
  • Comments: 52 (16 by maintainers)

Commits related to this issue

Most upvoted comments

@bishal-pdMSFT , we don’t have the plan to release a hotfix for it.

Hi @bishal-pdMSFT ,

The stuck can have several causes. For now, I can repro a stuck with unplugging network cable from my local machine. For a downloading, the SDK will split it into small pieces and send out a downloading request for each piece in parallel. We’ll need to set a timeout for each request to download the small chunks. I’ll make a fix ASAP.

Thanks Emma

Talked with @kotewar . For the downloading failure, service had successfully handled the request and had sent response headers to client, then it was writing response body to client side. Service got a connect break error when partial of response body has been written to client.

Currently, the download interface in the SDK would only send one request for download operation. For downloading a large file, it requires to keep connection alive for pretty long time, there’s big chance to meet connection breaking during this time.
A possible solution for downloading a large file more stably should be: split a large blob to small pieces and send download request for each piece.

I’ll put the downloading/uploading improvement in our backlog.

PR to for the fix is merge: https://github.com/Azure/azure-sdk-for-js/pull/22894 We are planning to have a preview release in this week.

👋 , we are seeing multiple customer issues created for this issue. Do we have any update for this

Hi, the timeout fix has been released in the latest GA version: @azure/storage-blob@12.12.0: https://www.npmjs.com/package/@azure/storage-blob/v/12.12.0

@EmmaZhu bumping this up as this is increasingly affecting more and more customers. The worse part here is that the download gets stuck forever and hence the GitHub Actions workflow run gets stuck. Since the run minutes are billed, it is causing a lot of concern.

There are two issues in play here:

  1. The download gets stuck - the repro steps and SDK logs have been provided already but please let us know if any more information is needed to debug further
  2. The timeout is not honoured - I think it is higher priority to fix this first as this as this will atleast fail the stuck download faster.

@xirzec - I used v12.10.0 and I am able to reproduce the same issue.

Logs here - https://github.com/kotewar/cron-action-test-download-bug/runs/7047939232?check_suite_focus=true

Hi, I’m having the same issue with using python azure sdk. Download of rather large files (several GBs) sometimes gets stuck (we have 48 times per day and in 10%-15% of tries it just gets stuck) and never times out even when I have set the timeout. Looks like it gets stuck close to the end of file it tries to download. Here is the part of log showing where how it gets stuck:

[2023-01-09, 06:21:33 UTC] {connectionpool.py:442} DEBUG - [https://xxxx.blob.core.windows.net:443](https://xxxx.blob.core.windows.net/) "GET /folder/file.json?timeout=1800 HTTP/1.1" 206 4194304
...
...
...
[2023-01-09, 06:41:32 UTC] {connectionpool.py:442} DEBUG - [https://xxxx.blob.core.windows.net:443]
(https://xxxx.blob.core.windows.net/) "GET /folder/file.json?timeout=1800 HTTP/1.1" 206 4194304
[2023-01-09, 06:41:33 UTC] {connectionpool.py:442} DEBUG - [https://xxxx.blob.core.windows.net:443](https://xxxx.blob.core.windows.net/) "GET/folder/file.json?timeout=1800 HTTP/1.1" 206 592006

Hi @EmmaZhu, thanks for the update.

I have been testing the changes extensively in a workflow with this latest version. Although I’ve seen some improvements, we haven’t seen consistency in the download. I’ve shared my observations here.

To summarise, out of the multiple rans we ran in past couple weeks we saw -

  1. 1 timeout where the download got stuck
  2. Multiple runs running till 17 minutes, getting stuck in middle and automatically recovering and downloading
  3. Multiple runs timing out within 2 minutes.

All this despite of mentioning a timeout of 30000 ms.

Can the team please take another look at what might be missing?

In case it is helpful, you can find many examples of this error occurring within the failures of this workflow. All or nearly all failures that show a runtime of 6h Xm Ys are due to this problem. There are roughly 40 examples in the last week.

For reference, https://github.com/actions/toolkit/blob/main/packages/cache/src/internal/downloadUtils.ts#L212 shows how actions/cache is using the Azure SDK. Note it is passing in

     tryTimeoutInMs: options.timeoutInMs`

which is set to 30000 (30 seconds) by default. From the docs:

Optional. Indicates the maximum time in ms allowed for any single try of an HTTP request. A value of zero or undefined means no default timeout on SDK client, Azure Storage server’s default timeout policy will be used.

And from the screenshots provided in the earlier comments, it appears this value isn’t having the intended effect as we would expect the stuck request to timeout, possibly with a few retries, much faster than what’s seen.

Moving this back over, I got confused by the linked repro having ‘java-example’ in the name and thought this had been misfiled.

@kotewar I assume the package you are using is actually @azure/storage-blob ? It looks like your version is slightly out of date, can you reproduce this issue with the latest (12.10.0)?

Btw sometimes when download is stuck I also see the following: [connectionpool] [INFO] Resetting dropped connection: <hostname> that probably means that Azure is dropping connections for some reason.

@ilyadinaburg, thanks for bringing this up. I was assuming until today that this issue is only with the JavaScript SDK. @EmmaZhu can you please loop in the storage team into this issue? Seems like its getting more and more widespread and could be related to the backend/throttling that we might not be aware of. 🤔

@styfle , We are planning to have the GA release in the second week in October…

To clarify, this is for the workaround (timeout) fix, correct?

Any update on a root cause fix?

@EmmaZhu When can we expect this fix to be released?

@EmmaZhu this would qualify for a hot-patch. Can we take that route? We need a quick resolution.

@EmmaZhu, the failure as I mentioned doesn’t happen always, so if you check the successful jobs in the same run, you’ll notice that same file or segment where it got stuck was downloaded successfully by other jobs. So yes, this might take time for you to reproduce, but one observation from our end has been the following - This issue mostly occurs when the same file is getting downloaded in parallel by multiple jobs Not sure if our finding helps you, but I just thought I’ll bring this up here that might give you any clue.

I am attaching the whole workflow logs and failure execution logs, we had enabled debugging in Azure SDK to enable showing the download progress with headers and response. hope this helps - logs_83.zip Failure logs.txt