azure-storage-azcopy: Cannot resume azcopy job, possibly due to running out of memory

Which version of the AzCopy was used?

10.3.1

Which platform are you using? (ex: Windows, Mac, Linux)

Linux (inside docker ubuntu:latest)

What command did you run?

First I ran

azcopy copy "/data/*" "https://somewhere.blob.core.windows.net/azcopytest/XXXXX" --output-type text --put-md5 --no-guess-mime-type --content-type application/octet-stream --overwrite true --follow-symlinks --from-to LocalBlob --log-level ERROR --exclude-path ".snapshot/;azcopy/" --recursive=true
INFO: Scanning...

Job 86e3f350-2675-0b44-6775-8fc793e1ad7c has started
Log file is located at: /data/azcopy/86e3f350-2675-0b44-6775-8fc793e1ad7c.log

2.1 %, 831 Done, 0 Failed, 39169 Pending, 0 Skipped, 40000 Total (scanning...), 2-sec Throughput (Mb/s): 181.2666fatal error: runtime: out of memory

runtime stack:
runtime.throw(0xb734ec, 0x16)
	/opt/hostedtoolcache/go/1.12.0/x64/src/runtime/panic.go:617 +0x72
runtime.sysMap(0xc088000000, 0x4000000, 0x11e4af8)
	/opt/hostedtoolcache/go/1.12.0/x64/src/runtime/mem_linux.go:170 +0xc7
...
....

I then tried resuming the job, but ran into the following issue

azcopy jobs resume 86e3f350-2675-0b44-6775-8fc793e1ad7c --destination-sas "xxxx"
cannot resume job with JobId 86e3f350-2675-0b44-6775-8fc793e1ad7c . It hasn't been ordered completely

How can we reproduce the problem in the simplest way?

Probably try running out of memory, since that seems to corrupt execution plan

Have you found a mitigation/solution?

No

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 29 (18 by maintainers)

Most upvoted comments

@landro Thanks for the tip about az-blob-hashdeep. I had not seen that before. One important point about it: it just reads the MD5 hash that is stored against the blob. It’s important to understand that the hash is supplied by the tool that uploads the blob (AzCopy in this case) and nothing in the blob storage service actually checks that the blob content matches the hash. The only time the check takes place is if you download the blob with a hash-aware tool (such as AzCopy).

So using az-blob-hashdeep allows you to check that every blob has a hash and that those hashes match the hashes computed by hashdeep from your local (original) copy of the same files. But it does not prove that the blob content actually matches those hashes.

Below is a long extract of a draft description I wrote recently. We haven’t published it anywhere, but I’ll share it here on the understanding that it’s a first draft. It contains a more detailed description, and a tip on how to check the blob content.

File Content Integrity

“Were any bytes in a file changed, added or omitted?”

Checking this works by having the client side tool (in this case, AzCopy) do two things: (a) At upload time, compute the MD5 hash of the original source file, as it reads it from disk. This hash gets stored as the ContentMD5 property of the blob. You can think of this as the “Hash of the original file, as read”.
(b) At download time, compute the MD5 hash of the file, as it saves it to disk. You can think of this as the “hash of the downloaded file, as saved”. If the two hashes match, that proves that everything is fine. The hash of the original file, as read, matches the hash of the downloaded file, as saved, so that proves the integrity of the whole end-to-end process. If they don’t match then (by default) AzCopy will report an error.

See the AzCopy parameters --put-md5 and --check-md5 for details. Note that for the strictest checking, your download can use --check-md5 FailIfDifferentOrMissing.

The key point here is that that the check is done at download time. So how do you use that check? There are two options:

  • Get all your download tooling to do the check. E.g. if you upload with AzCopy but download with your own SDK based tooling, then you need to make sure that your download code is doing the check. In this approach, you don’t detect any errors immediately. You detect them later, when the data is used. OR

  • You can download everything straight away just to check it. That sounds awful… but there’s a trick to make it much easier. The trick is this: Make a large VM in the same region as your Storage Account (e.g. 16 cores or more). Run AzCopy on that VM and download all your data, but specify NUL as the destination. (NUL on Windows, /dev/null on Linux). This tells AzCopy “download all the data, but don’t save it anywhere”. That way, you can check the hashes really quickly. If there’s any case where the hash doesn’t match, AzCopy will tell you. In many cases, you can check at least 4TB of date per hour this way (but it’s slower if the files are very small).

If you want to use these download-time checks, its important that --put-md5 is on the AzCopy command line at the time of upload. That is the default in Storage Explorer 1.10.1.

Data Transmission Integrity “Did the network corrupt our data?”

This is covered simply by using HTTPS. Because HTTPS protects you against malicious tampering, it must also therefore protect you against accidental tampering (i.e. network corrupting your data).

Note that this means you don’t need to check MD5 hashes to protect against network errors. Checking MD5s (as described above under File Content Integrity) proves that AzCopy and the Storage service didn’t mess anything up. But it doesn’t prove anything about the network … because if you’re using HTTPS you already have proof that the network didn’t mess anything up.

File Selection Integrity

“Did we move the right files?”

This is not about the content of the files, but its about which files were processed. E.g. did the tool leave any out? Obviously you can check the file counts that are reported by AzCopy. We don’t currently have anything else built in to the tool to check this. With other tools, and a maybe little scripting, it is possible to get listings of the source and destination and compare them. [This is where I see az-blob-hashdeep being particularly useful]

(See also small followup comment below)

Here is the performance profile from using azcopy 10.3.2 with AZCOPY_CONCURRENCY_VALUE=16 and AZCOPY_BUFFER_GB=0.5 to copy 1.2 M files (that is 300K more than the last test https://github.com/Azure/azure-storage-azcopy/issues/715#issuecomment-549273031):

Screenshot 2019-11-20 at 09 11 52

When the azcopy job finishes (when TX drops to 0), the OS releases 800MB of ram. 500 of these 800 MB are (probably) used by the buffer, so basically azcopy itself used around 300 MB. During the last test https://github.com/Azure/azure-storage-azcopy/issues/715#issuecomment-549273031, azcopy 10.3.1 itself used around 500 MB while transferring 1.0M files, so it seems like things have improved, even if more files are being transferred! There is also reason to believe that the memory footprint of azcopy is not impacted by the number of files to copy, which was my main concern. I will therefor close this issue now.

Key takeaways for other folks that migrate from NFS to Azure blob storage:

  • Make sure you have enough free memory on you NFS client to handle dentry and nfs_inode_cache caches since the linux kernel releases this cache relatively slowly and there are few options for tuning this cache behaviour
  • Make sure to verify the transfer works as expected using tools like hashdeep and az-blob-hashdeep

I’ll give the new version a test drive on a bigger dataset tomorrow and let you guys know how it behaves.

  1. nov. 2019 kl. 03:21 skrev John Rusk [MSFT] notifications@github.com:

@landro We released version 10.3.2 yesterday. It returns unused memory to the OS more promptly. I don’t know for sure if that will help in your situation, but it might.

Do you think we need to do anything more on this issue? If so, can you please answer the questions that Ze and I left, above. Otherwise, we’ll close this issue soon.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.