dvc: adding external data on S3 fails
To keep track of the problem:
https://discordapp.com/channels/485586884165107732/485596304961962003/678284389230182430
the data set i'm trying to add is an LDC data set
in s3://BUCKET/LDC/LDC96T17/ there's only one directory
callhome_spanish_trans_970711
adding s3://BUCKET/LDC/LDC96T17/ fails
adding s3://BUCKET/LDC/LDC96T17/callhome_spanish_trans_970711/ work
more context here:
https://discordapp.com/channels/485586884165107732/485596304961962003/678261668723032075
ERROR: s3://<BUCKET>/dvc_cache/da/8ae919cf1e2400af35eae297eacd67 does not exist: An error occurred (404) when calling the HeadObject operation: Not Found
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (15 by maintainers)
We do move the file from storage to the cache (i.e. copy and then remove the file from storage), and then relinked again from the cache to storage.
https://github.com/iterative/dvc/blob/682275dc0a6be12859da03747a319dcebfc1688a/dvc/remote/base.py#L426-L428
As this happens on
RemoteBase
and._save_file()
, this likely affects all the remote storages we support (not confimed except for s3) and file & dir uploads as well respectively.You can see at the following frames (top - bottom depth, see last two lines):
It can also be verified with a quick
aws s3 ls s3://<path>
(check timestamp).Log for single file upload
Log for folder uploads
@shcheklein Yes, I think eventual consistency is to blame for that too. Our S3 tests have been flakey for quite a while now (probably not these days as we migrated to
moto
).Another thing to note that the following is not guaranteed in S3 (emphasized), for which I think, we do quite often for external outputs:
Thanks @skshetry , great research!!