aws-sdk-js-v3: lib-storage Upload does not retry if one part fails.

Describe the bug

I’m using Upload from lib-storage to get advantage of multipart uploads regarding large files. I would expect it to retry in case any of failure uploading any part. Imagine we have a 5TB file that is splitted in 5MB parts. That would result in many, many parts, so the chance of individual failure is high.

There is the leavePartsOnError option. Setting it to true interrupts the whole upload if any part fails.

Your environment

SDK version number

@aws-sdk/lib-storage@3.13.1

Is the issue in the browser/Node.js/ReactNative?

Browser

Details of the browser/Node.js/ReactNative version

  Browsers:
    Chrome: 90.0.4430.85
    Firefox: 78.0.2
    Safari: 14.0.3

Steps to reproduce

Upload a file using Upload, exactly like in the example:

       try {
          const upload = new Upload({
            client: s3,
            params: uploadParams,
            leavePartsOnError: false
          })

          upload.on('httpUploadProgress', progress => {
            console.log(progress)
          })

          const uploadResult = await upload.done()
          console.log('done uploading', uploadResult)
        }
        catch (err) {
          showErrors('There was an error uploading your photo: ', err)
        }

Observed behavior

If there are no errors with any part, the upload goes right. If there is an error in one or more parts, that part is skipped and the final file is built without those parts. This results in a corrupt file. This can be reproduced stopping the connection (turning WiFi off) or using browser DevTools to stop the connection.

Expected behavior

If a single part fails to upload, what is likely to happen with big files in mobile environments, just retry to send that part. This could be done for example here: https://github.com/aws/aws-sdk-js-v3/blob/28008c2a3e30cd234f447484d76777d5db56dad1/lib/lib-storage/src/Upload.ts#L124-L130

Do not send the final build command if any part is missing: https://github.com/aws/aws-sdk-js-v3/blob/28008c2a3e30cd234f447484d76777d5db56dad1/lib/lib-storage/src/Upload.ts#L152-L162

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 8
  • Comments: 19 (2 by maintainers)

Commits related to this issue

Most upvoted comments

If I understand it correctly, the Upload module is designed so that (with the default behavior leavePartsOnError=false) if one part fails to upload, the whole upload will finish but without that part, creating a corrupt file and the promise will resolve with success. To me this seems like a very scary default behavior, and I would think that leavePartsOnError=true would be a safer default, or am I missing something?

Update: If I use leavePartsOnError: true, it actually retries before failing with @aws-sdk/client-s3 3.36.0 when using the S3-client from @aws-sdk/client-s3, due to the built in retry strategy in the client.

Retry uploads of failed multiparts apparently was a feature of AWS SDK v2. Now with lib-storage it doesn’t have it, so it’s a regression.

Also, here’s what AWS recommends:

Enable a retry mechanism in the application making requests

Because of the distributed nature of Amazon S3, requests that return 500 or 503 errors can be retried. It's a best practice to build retry logic into applications that make requests to Amazon S3. We recommend that you set the retry count for your applications to 10.

All AWS SDKs have a built-in retry mechanism with an algorithm that uses exponential backoff. This algorithm implements increasingly longer wait times between retries for consecutive error responses. Most exponential backoff algorithms use jitter (randomized delay) to prevent successive collisions. For more information, see Error Retries and Exponential Backoff in AWS.

There are 500 errors from time to time when uploading multi-part files and currently, none of JS AWS maintained libraries handle them properly. Quick search shows how common and frustrating those are with no solution offered: https://www.google.com/search?q=s3+multipart+upload+error+500

@alvaromat I would argue not to use navigator.onLine, as then it would leave ability to use this code in different environments, not just browser (next.js, react-native(web), node, etc.)

All that’s needed to retry is just retry failed parts for maxRetryAttempts with a delay of a second or so… Since it’s retrying individual parts to recover from an occasional error 500, or to survive a short drop of connectivity, the exponential backoff is really not required.

Assigning @ajredniwja to support the discussion