moto: upload file with boto, download it with boto3: file gets corrupted (wrong md5 sum)

Hi,

The following code uploads a file to a mock S3 bucket using boto, and downloads the same file to the local disk using boto3. I apologize for bringing both of the libraries into this, but the code I am testing in real life still uses both (definitely trying to get rid of all the boto code and fully migrate to boto3 but that isn’t going to happen right away).

What happens is that the resulting file does not have the same md5 sum as the original file so it has been corrupted at some point (not sure if it was during the boto upload or the boto3 download).

This seems to be an issue with moto because if I comment out the line @moto.mock_s3 (using ‘real’ S3) the script works fine (I also need to change the bucket name to a unique one to avoid collisions).

The script keeps looping (doing the upload/download/md5sum comparison) until it fails (because in my real project this would not happen every time) but this test script seems to fail (for me anyway) on the first attempt every time.

The test file that it uploads/downloads is available here.

You can download it with:

curl -O  https://s3-us-west-2.amazonaws.com/demonstrate-moto-problem/K158154-Mi001716_S1_L001_R1_001.fastq.gz

At this point if you run md5sum on it you should get 6083801a29ef4ebf78fbbed806e6ab2c:

$ md5sum K158154-Mi001716_S1_L001_R1_001.fastq.gz
6083801a29ef4ebf78fbbed806e6ab2c  K158154-Mi001716_S1_L001_R1_001.fastq.gz

Here is the test script (motoprob.py):

import sys
import os
import hashlib
import moto
import boto
import boto3

def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()



@moto.mock_s3
def doit():
    # upload file to s3
    conn = boto.connect_s3()
    bkt = conn.create_bucket("mybucket")
    key = boto.s3.key.Key(bkt)
    key.key = "foo/bar.fastq.gz"
    print("Uploading...")

    # You can get this file from:
    #  https://s3-us-west-2.amazonaws.com/demonstrate-moto-problem/K158154-Mi001716_S1_L001_R1_001.fastq.gz
    key.set_contents_from_filename("K158154-Mi001716_S1_L001_R1_001.fastq.gz")

    # download it again
    dlfile = "bar.fastq.gz"
    if os.path.exists(dlfile):
        os.remove(dlfile)

    print("Downloading...")

    client = boto3.client('s3')
    client.download_file(Bucket="mybucket",
      Key="foo/bar.fastq.gz", Filename="bar.fastq.gz")


    md5sum = md5(dlfile)
    if not md5sum == "6083801a29ef4ebf78fbbed806e6ab2c":
        print("Incorrect md5sum! {}").format(md5sum)
        sys.exit(1)


while True:
    doit()

Version info:

$ pip freeze |grep oto
boto==2.42.0
boto3==1.4.0
botocore==1.4.48
moto==0.4.29

$ python --version
Python 2.7.12

$ uname -a
Linux f51bec2ad3be 4.9.4-moby #1 SMP Wed Jan 18 17:04:43 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

$ more /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.1 LTS"

Other ways to see that the resulting file is not the same as the original:

$ diff bar.fastq.gz /data/2016-10-27-PT140/K158154-Mi001716_S1_L001_R1_001.fastq.gz
Binary files bar.fastq.gz and /data/2016-10-27-PT140/K158154-Mi001716_S1_L001_R1_001.fastq.gz differ


$ zcat bar.fastq.gz > bar.fastq # this works for the original file

gzip: bar.fastq.gz: invalid compressed data--crc error

gzip: bar.fastq.gz: invalid compressed data--length error

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 16 (13 by maintainers)

Commits related to this issue

Most upvoted comments

There have been a few improvements in how we handle md5sums/etags since 2020 - is anyone still running into issues using the latest version of Moto?

Turns out that disabling multi-threading in managed transfer methods is quite easy. With this change in moto, your original code works fine:

diff --git a/moto/core/models.py b/moto/core/models.py
index 60e744f..03d1390 100644
--- a/moto/core/models.py
+++ b/moto/core/models.py
@@ -4,6 +4,7 @@ import functools
 import inspect
 import re
 
+import boto3
 from httpretty import HTTPretty
 from .responses import metadata_response
 from .utils import convert_regex_to_flask_path
@@ -11,6 +12,7 @@ from .utils import convert_regex_to_flask_path
 
 class MockAWS(object):
     nested_count = 0
+    original_create_transfer_manager = None
 
     def __init__(self, backends):
         self.backends = backends
@@ -38,6 +40,15 @@ class MockAWS(object):
         if not HTTPretty.is_enabled():
             HTTPretty.enable()
 
+        if self.__class__.original_create_transfer_manager is None:
+            boto3.client('s3') # Ensure that boto.s3 exists
+            original_create_transfer_manager = boto3.s3.transfer.create_transfer_manager
+            self.__class__.original_create_transfer_manager = original_create_transfer_manager
+            def patched_create_transfer_manager(client, config, *args, **kwargs):
+                config.use_threads = False
+                return original_create_transfer_manager(client, config, *args, **kwargs)
+            boto3.s3.transfer.create_transfer_manager = patched_create_transfer_manager
+
         for method in HTTPretty.METHODS:
             backend = list(self.backends.values())[0]
             for key, value in backend.urls.items():
@@ -63,6 +74,8 @@ class MockAWS(object):
         if self.__class__.nested_count == 0:
             HTTPretty.disable()
             HTTPretty.reset()
+            boto3.s3.transfer.create_transfer_manager = self.__class__.original_create_transfer_manager
+            self.__class__.original_create_transfer_manager = None
 
     def decorate_callable(self, func, reset):
         def wrapper(*args, **kwargs):

@spulec: Should I create a pull request for that or do you see a more elegant way of doing this?