dvc: ETag mismatch on MinIO external dependency add.

Assuming we have MinIO instance set up with two buckets (dvc-cache, data) on localhost:9000, and we try to add data from data bucket as external dependency we will get ETag mismatch error.

Example:

#!/bin/bash

rm -rf repo
mkdir repo

pushd repo
git init --quiet
dvc init -q

export AWS_ACCESS_KEY_ID="minioadmin"
export AWS_SECRET_ACCESS_KEY="minioadmin"  

dvc remote add s3cache s3://dvc-cache/cache
dvc config cache.s3 s3cache
dvc remote modify s3cache endpointurl http://localhost:9000
dvc remote modify s3cache use_ssl False

dvc remote add miniodata s3://data
dvc remote modify miniodata endpointurl http://localhost:9000
dvc remote modify miniodata use_ssl False

dvc add remote://miniodata/file

Will result with:

ERROR: ETag mismatch detected when copying file to cache! (expected:
 '4e102ec8d6ab714aae04d9e3c7b4c190-1', actual: 'ca9e5ed43f3fbee6edec
bb5ac6fba77e-1')

Related: #2629 , #3441

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 29 (11 by maintainers)

Most upvoted comments

Here’s what I tried:

docker run -v $PWD:/data -p 9000:9000 minio/minio:RELEASE.2020-03-06T22-23-56Z server --compat /data

I still got a similar error:

dvc add remote://miniodata/shapes_128_128_20180914T1515
Adding...                                                                                                                                                                      
ERROR: ETag mismatch detected when copying file to cache! (expected: '7461e0bd8ff44f448d2659cc04f0f86a-1', actual: '098a1335754cabb3ec1ab403e4597be1-1')  

https://github.com/minio/minio#caveats to understand more about why --compat might be needed here.

it should be handled as far as I remember:

No, it is not Parts can be uploaded in this manner

  • Part.1 - 5MiB
  • Part.2 - 6MiB
  • Part.3 - 1byte

This will result in ETag as md5hex(md5(5MiB) + md5(6MiB) + md5(1byte)-3

Now if you assume 3 parts content-length is 11MiB you have no idea what is the length used for 1st part, 2nd part - if you happen to choose 5MiB for both then you will result with an incorrect ETag which will mismatch. I can reproduce this right now with dvc using AWS S3. Of course I assume that this is not handled because it is a corner case and rare. Just so that you are aware I am clarifying this a bit.

It is still a gray area, since I’m not sure it’s not officially documented how ETAG is calculated from multi-parts, but I think it is reasonable for our users to rely on that optimization for now. As @efiop mentioned we can introduce BLAKE2 os something similar if Amazon at some decides to change the logic behind it.

multipart ETAG is nothing but the hexmd5(md5(part1) + md5(part2)...)-N this is documented not in AWS S3 docs but found while talking to AWS support.

Btw, I wonder if s3.upload_part_copy moves bytes first to the local machine before sending then back to S3 or S3 has a way to copy remotely? If it pulls the data locally, we can potentially calculate the hash while we copy data w/o affecting performance. In this case we would need S3 to support rename at least.

The server-side copy of parts is called CopyObjectPart() - which I see that you are using when you see ETag as a - at the end.

NOTE: This assumption will also fail for SSE-C encrypted objects as well because AWS S3 doesn’t return a proper ETag when you have SSE-C encrypted objects - meaning an SSE-C object will change its ETag automatically upon an overwrite.

https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html

The entity tag is a hash of the object. The ETag reflects changes only to the contents 
of an object, not its metadata. The ETag may or may not be an MD5 digest of the
object data. Whether or not it depends on how the object was created and how it 
is encrypted as described below:

- Objects created by the PUT Object, POST Object, or Copy operation, or through the 
AWS Management Console, and are encrypted by SSE-S3 or plaintext, have ETags 
that are an MD5 digest of their object data.

- Objects created by the PUT Object, POST Object, or Copy operation, or through the 
AWS Management Console, and are encrypted by SSE-C or SSE-KMS, have 
ETags that are not an MD5 digest of their object data.

- If an object is created by either the Multipart Upload or Part Copy operation, 
the ETag is not an MD5 digest, regardless of the method of encryption.

@shcheklein, there’s an issue which was closed as it was supposed to work as this on minio. See: https://github.com/minio/minio/issues/8012#issuecomment-519757286

We could however suggest user to use --compat or find a better way thanEtag.