multicodec: Proposal: s3-sha2-256-x for Amazon's parallel checksums

We would like to use a multihash for the new Amazon S3 Checkums, specifically their SHA-256 variant. Is this the right place and way to request a new prefix for that?

These differ from traditional sha-256 in that they are computed in parallel using a variable block size. Being amazon, some of their SDKs use 8mb and others use 5mb (though it is user-configurable).

Specifically, I would like to propose (by analogy with dbl-sha2-256 and ssz-sha2-256-bmt):

s3-sha2-256-5,                   multihash,      0xb515,           draft,
s3-sha2-256-8,                   multihash,      0xb518,           draft,

This could be used both for the native S3 algorithm, as well as compatible implementations (such as ours).

I believe this satisfies the “two implementations” requirement for mulitformats registration. Is that sufficient? Should I just create a PR, or is there a more formal process?

Thanks!

About this issue

  • Original URL
  • State: closed
  • Created 5 months ago
  • Comments: 33 (22 by maintainers)

Most upvoted comments

This issue is closed by https://github.com/multiformats/multicodec/pull/343. Thanks for everyone involved and being patient to really get to the root of this.

chunk size starts with 8MiB, but the number of chunks is limited to 10k, so if the total size is > 8MiB*10k, we double the chunk size until it fits, so that the chunks are uniform (except the last one) and don’t exceed the selected chunk size.

Thanks for the explanation (sorry I should’ve read https://github.com/quiltdata/quilt/blob/s3_sha256/docs/PARALLEL_CHECKSUMS.md carefully).

What about something like “Hash of concatenated SHA2-256 digests of 8*2^n MiB source chunks (n = data_size / 10000)”?

Wow, thank you for diving into this so deeply! I really appreciate it.

Let me explain our use case, and hopefully you can tell me if there’s a better option.

Our product creates “manifests” that associate an S3 URI with a content-hash. We need to support different types of content hashes, which must be clearly labeled so customers can independently verify that the URI matches its hash.

We want to use multihash encodings so customers have a consistent way to interpret and execute hashing as simple string (versus a complex struct with a seperate, non-standard type field). Note that these manifests are archival documents that are shared throughout our ecosystem, e.g. potentially in FDA filings, so it is important they be globally interpretable.

The specific algorithm we are using is probably better characterized as:

s3-sha2-256-boto, multihash, 0xb511, draft, Boto3-compatible implementation Amazon S3's parallel SHA2-256 checksums 

That is, we want to signal that this is encoding using the precise block-chaining strategy used by Amazon’s boto3 Python library.

Can you suggest an alternate way to satisfy that use case?

Thanks @sir-sigurd for the information, I don’t have much AWS knowledge.

Let me try it again with making sure it’s a multipart upload.

Allocate a file > 32MiB:

$ fallocate --length 34567890 mydata.dat

Upload the file through the AWS console in the browser (as it’s easer than from the CLI). Then get the SHA-256 checksome of it:

$ aws s3api get-object-attributes --bucket <your-bucket> --key mydata.dat --object-attributes 'Checksum'
{
    "LastModified": "2024-02-16T10:27:10+00:00",
    "Checksum": {
        "ChecksumSHA256": "eS1aSUoSnbLv53dDOSSjmhilAUkzfJsEiZKg3+lCjBc="
    }
}

Check what the SHA-256 hash of the local file is:

$ sha256sum mydata.dat | cut --bytes -64| xxd --revert --plain | base64
kcqYuUA3ieUC27HjC5ikS8/J5Av3dPbqjhqveNtOWXs=

They don’t match. Can we reproduce that hash someone? Let’s try the multipart upload with the CLI again.

Start a multipart upload, with SHA-256 checksums enabled:

$ aws s3api create-multipart-upload --bucket <your-bucket> --key mydata.dat --checksum-algorithm SHA256
{
    "ServerSideEncryption": "AES256",
    "ChecksumAlgorithm": "SHA256",
    "Bucket": <your-bucket>",
    "Key": "mydata.dat",
    "UploadId": "yUqjwbmY8qCHkR6vkIK3rnebBpcg0xpQOCS5oypnbXxIS.Z9UC2UbFWgUEZL6dbSFjlX4_1k30pg6yFrJU5P3jPUKuZbIYH352Ws9FfYBHr4oeDn4hIzPB9ORD24omGL"
}

Split the original file into two pieces named mydata_01.dat and mydata_02.dat:

$ split --bytes 23456789 --numeric-suffixes=1 --additional-suffix '.dat' mydata.dat mydata_

Let’s check the SHA-256 hashes of the individual parts:

$ sha256sum mydata_01.dat | cut --bytes -64| xxd --revert --plain | base64
34PAmeNpBEbY7kQBvFK+JY2mtu2BkoieWIKF0zz7vrY=
$ sha256sum mydata_02.dat | cut --bytes -64| xxd --revert --plain | base64
JYEm3isfHSWZVlgLxgQyp670tv7Zq6e1OGC41HVxOgY=

Upload the first part:

$ aws s3api upload-part --bucket <your-bucket> --key mydata.dat --part-number 1 --body mydata_01.dat --upload-id 'yUqjwbmY8qCHkR6vkIK3rnebBpcg0xpQOCS5oypnbXxIS.Z9UC2UbFWgUEZL6dbSFjlX4_1k30pg6yFrJU5P3jPUKuZbIYH352Ws9FfYBHr4oeDn4hIzPB9ORD24omGL' --checksum-algorithm SHA256
{
    "ServerSideEncryption": "AES256",
    "ETag": "\"47b18dae3b998a0e8ea25df8556e90e2\"",
    "ChecksumSHA256": "34PAmeNpBEbY7kQBvFK+JY2mtu2BkoieWIKF0zz7vrY="
}

And the second part:

$ aws s3api upload-part --bucket <you-bucket> --key mydata.dat --part-number 2 --body mydata_02.dat --upload-id 'yUqjwbmY8qCHkR6vkIK3rnebBpcg0xpQOCS5oypnbXxIS.Z9UC2UbFWgUEZL6dbSFjlX4_1k30pg6yFrJU5P3jPUKuZbIYH352Ws9FfYBHr4oeDn4hIzPB9ORD24omGL' --checksum-algorithm SHA256
{
    "ServerSideEncryption": "AES256",
    "ETag": "\"b677de10e5b49d46d81948168383c3dc\"",
    "ChecksumSHA256": "JYEm3isfHSWZVlgLxgQyp670tv7Zq6e1OGC41HVxOgY="
}

In order to complete the upload we need to create a JSON file which contains the parts:

echo '{"Parts":[{"ChecksumSHA256":"34PAmeNpBEbY7kQBvFK+JY2mtu2BkoieWIKF0zz7vrY=","ETag":"47b18dae3b998a0e8ea25df8556e90e2","PartNumber":1},{"ChecksumSHA256":"JYEm3isfHSWZVlgLxgQyp670tv7Zq6e1OGC41HVxOgY=","ETag":"b677de10e5b49d46d81948168383c3dc","PartNumber":2}]}' > mydata_parts.json

Now finish the multipart upload:

$ aws s3api complete-multipart-upload --multipart-upload file://mydata_parts.json --bucket <yout-bucket> --key mydata.dat --upload-id 'yUqjwbmY8qCHkR6vkIK3rnebBpcg0xpQOCS5oypnbXxIS.Z9UC2UbFWgUEZL6dbSFjlX4_1k30pg6yFrJU5P3jPUKuZbIYH352Ws9FfYBHr4oeDn4hIzPB9ORD24omGL'
{
    "ServerSideEncryption": "AES256",
    "Location": "https://<your-bucket>.s3.eu-north-1.amazonaws.com/mydata.dat",
    "Bucket": <your-bucket>,
    "Key": "mydata.dat",
    "ETag": "\"ad31d1c7112c160438ff0ea063ec2c75-2\"",
    "ChecksumSHA256": "XtqASA/PuE6k06Ccd7uxnoykSwypoulhrxFkXi2Y1qY=-2"
}

Get the checksum again to double-check:

$ aws s3api get-object-attributes --bucket <your-bucket> --key mydata.dat --object-attributes 'Checksum'
{
    "LastModified": "2024-02-16T11:19:25+00:00",
    "Checksum": {
        "ChecksumSHA256": "XtqASA/PuE6k06Ccd7uxnoykSwypoulhrxFkXi2Y1qY="
    }
}

That is different from the SHA-256 hash of the file:

$ sha256sum mydata.dat | cut --bytes -64| xxd --revert --plain | base64
kcqYuUA3ieUC27HjC5ikS8/J5Av3dPbqjhqveNtOWXs=

So the individual parts match, but the hash of the whole file does not. Let’s see if the hash of the file changes if we upload the parts in reverse order.

$ aws s3api create-multipart-upload --bucket <your-bucket> --key mydata_reverse.dat --checksum-algorithm SHA256
{
    "ServerSideEncryption": "AES256",
    "ChecksumAlgorithm": "SHA256",
    "Bucket": "<your-bucket>",
    "Key": "mydata_reverse.dat",
    "UploadId": "LDBFP0WvRixFCm9mcwgWUa7xlcG87hRgg2GhmQKwEPqQJHfmmzjtzgOmi7kY5YFqqFHLbOAmr01LbUV4oE38DWyWWafJGS6e5bblaedhNLvDvmpyFwQjbGNdpg8A_jwI"
}
$ aws s3api upload-part --bucket <your-bucket> --key mydata_reverse.dat --part-number 1 --body mydata_02.dat --upload-id 'LDBFP0WvRixFCm9mcwgWUa7xlcG87hRgg2GhmQKwEPqQJHfmmzjtzgOmi7kY5YFqqFHLbOAmr01LbUV4oE38DWyWWafJGS6e5bblaedhNLvDvmpyFwQjbGNdpg8A_jwI' --checksum-algorithm SHA256
{
    "ServerSideEncryption": "AES256",
    "ETag": "\"b677de10e5b49d46d81948168383c3dc\"",
    "ChecksumSHA256": "JYEm3isfHSWZVlgLxgQyp670tv7Zq6e1OGC41HVxOgY="
}
$ aws s3api upload-part --bucket <your-bucket> --key mydata_reverse.dat --part-number 2 --body mydata_01.dat --upload-id 'LDBFP0WvRixFCm9mcwgWUa7xlcG87hRgg2GhmQKwEPqQJHfmmzjtzgOmi7kY5YFqqFHLbOAmr01LbUV4oE38DWyWWafJGS6e5bblaedhNLvDvmpyFwQjbGNdpg8A_jwI' --checksum-algorithm SHA256
{
    "ServerSideEncryption": "AES256",
    "ETag": "\"47b18dae3b998a0e8ea25df8556e90e2\"",
    "ChecksumSHA256": "34PAmeNpBEbY7kQBvFK+JY2mtu2BkoieWIKF0zz7vrY="
}
$ echo '{"Parts":[{"ChecksumSHA256":"JYEm3isfHSWZVlgLxgQyp670tv7Zq6e1OGC41HVxOgY=","ETag":"b677de10e5b49d46d81948168383c3dc","PartNumber":1},{"ChecksumSHA256":"34PAmeNpBEbY7kQBvFK+JY2mtu2BkoieWIKF0zz7vrY=","ETag":"47b18dae3b998a0e8ea25df8556e90e2","PartNumber":2}]}' > mydata_reverse_parts.json
$ aws s3api complete-multipart-upload --multipart-upload file://mydata_reverse_parts.json --bucket <your-bucket> --key mydata_reverse.dat --upload-id 'LDBFP0WvRixFCm9mcwgWUa7xlcG87hRgg2GhmQKwEPqQJHfmmzjtzgOmi7kY5YFqqFHLbOAmr01LbUV4oE38DWyWWafJGS6e5bblaedhNLvDvmpyFwQjbGNdpg8A_jwI'
{
    "ServerSideEncryption": "AES256",
    "Location": "https://<your-bucket>.s3.eu-north-1.amazonaws.com/mydata_reverse.dat",
    "Bucket": <your-bucket>",
    "Key": "mydata_reverse.dat",
    "ETag": "\"1f4348ad0ad0456aa78ae6e3d1717dc6-2\"",
    "ChecksumSHA256": "AonVPvSgK/alROsay8U8FU7XCNoccOG6CdgJObEgZ2E=-2"
}
$ aws s3api get-object-attributes --bucket <your-bucket> --key mydata_reverse.dat --object-attributes 'Checksum'
{
    "LastModified": "2024-02-16T12:47:27+00:00",
    "Checksum": {
        "ChecksumSHA256": "AonVPvSgK/alROsay8U8FU7XCNoccOG6CdgJObEgZ2E="
    }
}
$ cat mydata_02.dat mydata_01.dat | sha256sum | cut --bytes -64| xxd --revert --plain | base64
kcqYuUA3ieUC27HjC5ikS8/J5Av3dPbqjhqveNtOWXs=

This means that the checksum of a file uploaded in multiple parts depends on the way the file is split. The exact same bytes (of the final file) can have different hashes. This also means that it does not depend on the block size the AWS SHA-256 implementation is using, the SHA-256 hash of the individual parts is just as expected.

The last open question left for me is: how are those hashes of the full files calculated then? I suspect it’s a concatenation of the hashes of the parts, let’s try that:

$ (sha256sum mydata_01.dat | cut --bytes -64| xxd --revert --plain; sha256sum mydata_02.dat | cut --bytes -64| xxd --revert --plain ) | sha256sum | cut --bytes -64| xxd --revert --plain | base64
XtqASA/PuE6k06Ccd7uxnoykSwypoulhrxFkXi2Y1qY=
$ (sha256sum mydata_02.dat | cut --bytes -64| xxd --revert --plain; sha256sum mydata_01.dat | cut --bytes -64| xxd --revert --plain ) | sha256sum | cut --bytes -64| xxd --revert --plain | base64            
AonVPvSgK/alROsay8U8FU7XCNoccOG6CdgJObEgZ2E=

Yes, that’s exactly it.

To conclude: I don’t think we should introduce a new multihash code, as the AWS SHA-256 hashing is algorithmically equal to the existing SHA-256 hash we already have.

~The 1-2 Mb limit is somewhat artificial and don’t apply to most places that IPLD blocks are used; I think bitswap is the only place where block sizes are most constraintd. Plus it’s not really a big concern for multicodecs because we don’t assume that you’re going to be passing them through IPFS systems or even using them in CIDs.~ (see below, not my beef)

@drernie I think a pull request should be fine for these, what you say here sounds reasonable, one of us will probably have a look in a bit more detail just to sanity check that it’s logical for them to have separate entries.