azure-sdk-for-js: Download from Blob storage gets stuck sometimes and never completes
- Package Name: Azure/azure-storage-node
- Package Version: v12.9.0
- Operating system: Ubuntu 20.04.4 LTS
- nodejs
- version: v16.13.0
- browser
- name/version: NA
- typescript
- version:
- Is the bug related to documentation in
- README.md
- source code documentation
- SDK API docs on https://docs.microsoft.com
Describe the bug
Download from blob storage fails to complete and gets stuck for very long time. Here’s an example run for the same (with info
logs from SDK)- https://github.com/kotewar/cron-action-test-download-bug/runs/6935926824?check_suite_focus=true

This doesn’t happen always but once in a while it is getting stuck for many users. There are many issues raised regarding the same by the users of Actions/Cache for the same.
References-
- https://github.com/actions/cache/issues/810
- https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/10289
To Reproduce Steps to reproduce the behavior: As this is an intermittent issue, this can be reproduced by having a github action scheduled that creates a cache huge in size. This workflow file can be used for the same.
Expected behavior The download should complete 100% everytime and not get stuck intermittently.
Screenshots
If applicable, add screenshots to help explain your problem.
Additional context Same file gets downloaded by multiple runners most of the times. And this issue is mostly seen when the same file is getting downloaded in parallel from Azure Blob Storage.
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 14
- Comments: 52 (16 by maintainers)
Commits related to this issue
- [CI] Use stopgap soltuion for hanging actions/cache bug There is a bug in actions/cache where downloads may hang indefinitely. This has plagued our workflows for months, causing many unnecessary fail... — committed to djaglowski/opentelemetry-collector-contrib by djaglowski 2 years ago
- [CI] Use stopgap solution for hanging actions/cache bug There is a bug in actions/cache where downloads may hang indefinitely. This has plagued our workflows for months, causing many unnecessary fail... — committed to djaglowski/opentelemetry-collector-contrib by djaglowski 2 years ago
- [CI] Use stopgap solution for hanging actions/cache bug (#13453) There is a bug in actions/cache where downloads may hang indefinitely. This has plagued our workflows for months, causing many unnece... — committed to open-telemetry/opentelemetry-collector-contrib by djaglowski 2 years ago
- [tests] Fix timeout for `actions/setup-node` (#8639) Try fixing the timeout again. For example: https://github.com/vercel/vercel/actions/runs/3130757219/jobs/5081381465 - Follow up to #8613 - Re... — committed to vercel/vercel by styfle 2 years ago
- [tests] Update to use `actions/cache@v3` (#9056) Use `actions/cache` directly instead of relying on `actions/setup-node` to see if this solves [the hanging restore](https://github.com/vercel/vercel/a... — committed to vercel/vercel by styfle 2 years ago
@bishal-pdMSFT , we don’t have the plan to release a hotfix for it.
Hi @bishal-pdMSFT ,
The stuck can have several causes. For now, I can repro a stuck with unplugging network cable from my local machine. For a downloading, the SDK will split it into small pieces and send out a downloading request for each piece in parallel. We’ll need to set a timeout for each request to download the small chunks. I’ll make a fix ASAP.
Thanks Emma
Talked with @kotewar . For the downloading failure, service had successfully handled the request and had sent response headers to client, then it was writing response body to client side. Service got a connect break error when partial of response body has been written to client.
Currently, the download interface in the SDK would only send one request for download operation. For downloading a large file, it requires to keep connection alive for pretty long time, there’s big chance to meet connection breaking during this time.
A possible solution for downloading a large file more stably should be: split a large blob to small pieces and send download request for each piece.
I’ll put the downloading/uploading improvement in our backlog.
PR to for the fix is merge: https://github.com/Azure/azure-sdk-for-js/pull/22894 We are planning to have a preview release in this week.
👋 , we are seeing multiple customer issues created for this issue. Do we have any update for this
Hi, the timeout fix has been released in the latest GA version: @azure/storage-blob@12.12.0: https://www.npmjs.com/package/@azure/storage-blob/v/12.12.0
@EmmaZhu bumping this up as this is increasingly affecting more and more customers. The worse part here is that the download gets stuck forever and hence the GitHub Actions workflow run gets stuck. Since the run minutes are billed, it is causing a lot of concern.
There are two issues in play here:
@xirzec - I used v12.10.0 and I am able to reproduce the same issue.
Logs here - https://github.com/kotewar/cron-action-test-download-bug/runs/7047939232?check_suite_focus=true
Hi, I’m having the same issue with using python azure sdk. Download of rather large files (several GBs) sometimes gets stuck (we have 48 times per day and in 10%-15% of tries it just gets stuck) and never times out even when I have set the timeout. Looks like it gets stuck close to the end of file it tries to download. Here is the part of log showing where how it gets stuck:
Hi @EmmaZhu, thanks for the update.
I have been testing the changes extensively in a workflow with this latest version. Although I’ve seen some improvements, we haven’t seen consistency in the download. I’ve shared my observations here.
To summarise, out of the multiple rans we ran in past couple weeks we saw -
All this despite of mentioning a timeout of 30000 ms.
Can the team please take another look at what might be missing?
In case it is helpful, you can find many examples of this error occurring within the failures of this workflow. All or nearly all failures that show a runtime of
6h Xm Ys
are due to this problem. There are roughly 40 examples in the last week.For reference, https://github.com/actions/toolkit/blob/main/packages/cache/src/internal/downloadUtils.ts#L212 shows how
actions/cache
is using the Azure SDK. Note it is passing inwhich is set to
30000
(30 seconds) by default. From the docs:And from the screenshots provided in the earlier comments, it appears this value isn’t having the intended effect as we would expect the stuck request to timeout, possibly with a few retries, much faster than what’s seen.
Moving this back over, I got confused by the linked repro having ‘java-example’ in the name and thought this had been misfiled.
@kotewar I assume the package you are using is actually
@azure/storage-blob
? It looks like your version is slightly out of date, can you reproduce this issue with the latest (12.10.0)?Btw sometimes when download is stuck I also see the following:
[connectionpool] [INFO] Resetting dropped connection: <hostname>
that probably means that Azure is dropping connections for some reason.@ilyadinaburg, thanks for bringing this up. I was assuming until today that this issue is only with the JavaScript SDK. @EmmaZhu can you please loop in the storage team into this issue? Seems like its getting more and more widespread and could be related to the backend/throttling that we might not be aware of. 🤔
To clarify, this is for the workaround (timeout) fix, correct?
Any update on a root cause fix?
@EmmaZhu When can we expect this fix to be released?
@EmmaZhu this would qualify for a hot-patch. Can we take that route? We need a quick resolution.
@EmmaZhu, the failure as I mentioned doesn’t happen always, so if you check the successful jobs in the same run, you’ll notice that same file or segment where it got stuck was downloaded successfully by other jobs. So yes, this might take time for you to reproduce, but one observation from our end has been the following -
This issue mostly occurs when the same file is getting downloaded in parallel by multiple jobs
Not sure if our finding helps you, but I just thought I’ll bring this up here that might give you any clue.I am attaching the whole workflow logs and failure execution logs, we had enabled debugging in Azure SDK to enable showing the download progress with headers and response. hope this helps - logs_83.zip Failure logs.txt