multicodec: Proposal: s3-sha2-256-x for Amazon's parallel checksums
We would like to use a multihash for the new Amazon S3 Checkums, specifically their SHA-256 variant. Is this the right place and way to request a new prefix for that?
These differ from traditional sha-256 in that they are computed in parallel using a variable block size. Being amazon, some of their SDKs use 8mb and others use 5mb (though it is user-configurable).
Specifically, I would like to propose (by analogy with dbl-sha2-256 and ssz-sha2-256-bmt):
s3-sha2-256-5, multihash, 0xb515, draft,
s3-sha2-256-8, multihash, 0xb518, draft,
This could be used both for the native S3 algorithm, as well as compatible implementations (such as ours).
I believe this satisfies the “two implementations” requirement for mulitformats registration. Is that sufficient? Should I just create a PR, or is there a more formal process?
Thanks!
About this issue
- Original URL
- State: closed
- Created 5 months ago
- Comments: 33 (22 by maintainers)
This issue is closed by https://github.com/multiformats/multicodec/pull/343. Thanks for everyone involved and being patient to really get to the root of this.
Thanks for the explanation (sorry I should’ve read https://github.com/quiltdata/quilt/blob/s3_sha256/docs/PARALLEL_CHECKSUMS.md carefully).
What about something like “Hash of concatenated SHA2-256 digests of 8*2^n MiB source chunks (n = data_size / 10000)”?
Wow, thank you for diving into this so deeply! I really appreciate it.
Let me explain our use case, and hopefully you can tell me if there’s a better option.
Our product creates “manifests” that associate an S3 URI with a content-hash. We need to support different types of content hashes, which must be clearly labeled so customers can independently verify that the URI matches its hash.
We want to use multihash encodings so customers have a consistent way to interpret and execute hashing as simple string (versus a complex struct with a seperate, non-standard type field). Note that these manifests are archival documents that are shared throughout our ecosystem, e.g. potentially in FDA filings, so it is important they be globally interpretable.
The specific algorithm we are using is probably better characterized as:
That is, we want to signal that this is encoding using the precise block-chaining strategy used by Amazon’s boto3 Python library.
Can you suggest an alternate way to satisfy that use case?
Thanks @sir-sigurd for the information, I don’t have much AWS knowledge.
Let me try it again with making sure it’s a multipart upload.
Allocate a file > 32MiB:
Upload the file through the AWS console in the browser (as it’s easer than from the CLI). Then get the SHA-256 checksome of it:
Check what the SHA-256 hash of the local file is:
They don’t match. Can we reproduce that hash someone? Let’s try the multipart upload with the CLI again.
Start a multipart upload, with SHA-256 checksums enabled:
Split the original file into two pieces named
mydata_01.datandmydata_02.dat:Let’s check the SHA-256 hashes of the individual parts:
Upload the first part:
And the second part:
In order to complete the upload we need to create a JSON file which contains the parts:
Now finish the multipart upload:
Get the checksum again to double-check:
That is different from the SHA-256 hash of the file:
So the individual parts match, but the hash of the whole file does not. Let’s see if the hash of the file changes if we upload the parts in reverse order.
This means that the checksum of a file uploaded in multiple parts depends on the way the file is split. The exact same bytes (of the final file) can have different hashes. This also means that it does not depend on the block size the AWS SHA-256 implementation is using, the SHA-256 hash of the individual parts is just as expected.
The last open question left for me is: how are those hashes of the full files calculated then? I suspect it’s a concatenation of the hashes of the parts, let’s try that:
Yes, that’s exactly it.
To conclude: I don’t think we should introduce a new multihash code, as the AWS SHA-256 hashing is algorithmically equal to the existing SHA-256 hash we already have.
~The 1-2 Mb limit is somewhat artificial and don’t apply to most places that IPLD blocks are used; I think bitswap is the only place where block sizes are most constraintd. Plus it’s not really a big concern for multicodecs because we don’t assume that you’re going to be passing them through IPFS systems or even using them in CIDs.~ (see below, not my beef)
@drernie I think a pull request should be fine for these, what you say here sounds reasonable, one of us will probably have a look in a bit more detail just to sanity check that it’s logical for them to have separate entries.