dvc: adding external data on S3 fails

To keep track of the problem:

https://discordapp.com/channels/485586884165107732/485596304961962003/678284389230182430

the data set i'm trying to add is an LDC data set
in s3://BUCKET/LDC/LDC96T17/ there's only one directory
callhome_spanish_trans_970711
adding s3://BUCKET/LDC/LDC96T17/ fails
adding s3://BUCKET/LDC/LDC96T17/callhome_spanish_trans_970711/ work

more context here:

https://discordapp.com/channels/485586884165107732/485596304961962003/678261668723032075

ERROR: s3://<BUCKET>/dvc_cache/da/8ae919cf1e2400af35eae297eacd67 does not exist: An error occurred (404) when calling the HeadObject operation: Not Found

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 15 (15 by maintainers)

Most upvoted comments

We do move the file from storage to the cache (i.e. copy and then remove the file from storage), and then relinked again from the cache to storage.

https://github.com/iterative/dvc/blob/682275dc0a6be12859da03747a319dcebfc1688a/dvc/remote/base.py#L426-L428

As this happens on RemoteBase and ._save_file(), this likely affects all the remote storages we support (not confimed except for s3) and file & dir uploads as well respectively.

You can see at the following frames (top - bottom depth, see last two lines):

<module>, __main__.py:7
main, main.py:50
run, add.py:20
wrapper, __init__.py:27
run, scm_context.py:4
add, add.py:86
wrapper, decorators.py:39
rwlocked, stage.py:161
__call__, decorators.py:60
commit, stage.py:823
commit, base.py:241
save, base.py:520
_save, base.py:526
_save_dir, base.py:479
_save_file, base.py:427
move, base.py:660
remove, s3.py:186

It can also be verified with a quick aws s3 ls s3://<path> (check timestamp).

Log for single file upload

➜ dvc add s3://dvc-temp/saugat/storage/foo -v
2020-03-13 15:55:39,505 DEBUG: PRAGMA user_version;
2020-03-13 15:55:39,505 DEBUG: fetched: [(3,)]
2020-03-13 15:55:39,505 DEBUG: CREATE TABLE IF NOT EXISTS state (inode INTEGER PRIMARY KEY, mtime TEXT NOT NULL, size TEXT NOT NULL, md5 TEXT NOT NULL, timestamp TEXT NOT NULL)
2020-03-13 15:55:39,506 DEBUG: CREATE TABLE IF NOT EXISTS state_info (count INTEGER)
2020-03-13 15:55:39,506 DEBUG: CREATE TABLE IF NOT EXISTS link_state (path TEXT PRIMARY KEY, inode INTEGER NOT NULL, mtime TEXT NOT NULL)
2020-03-13 15:55:39,506 DEBUG: INSERT OR IGNORE INTO state_info (count) SELECT 0 WHERE NOT EXISTS (SELECT * FROM state_info)
2020-03-13 15:55:39,506 DEBUG: PRAGMA user_version = 3;
2020-03-13 15:56:03,337 DEBUG: cache 's3://dvc-temp/saugat/tmp-1/cache/d3/b07384d113edec49eaa6238ad5ff00' expected 'd3b07384d113edec49eaa6238ad5ff00' actual 'd3b07384d113edec49eaa6238ad5ff00'
2020-03-13 15:56:18,814 DEBUG: Saving 's3://dvc-temp/saugat/storage/foo' to 's3://dvc-temp/saugat/tmp-1/cache/d3/b07384d113edec49eaa6238ad5ff00'.
2020-03-13 15:56:22,760 DEBUG: cache 's3://dvc-temp/saugat/tmp-1/cache/d3/b07384d113edec49eaa6238ad5ff00' expected 'd3b07384d113edec49eaa6238ad5ff00' actual 'd3b07384d113edec49eaa6238ad5ff00'
2020-03-13 15:56:22,761 DEBUG: Removing s3://dvc-temp/saugat/storage/foo
2020-03-13 15:56:30,530 DEBUG: Created 'copy': s3://dvc-temp/saugat/tmp-1/cache/d3/b07384d113edec49eaa6238ad5ff00 -> s3://dvc-temp/saugat/storage/foo
Stage is cached, skipping.
  0% Add|                                                                                                                                       |0/0 [00:51,     ?file/s]
2020-03-13 15:56:30,953 DEBUG: SELECT count from state_info WHERE rowid=?
2020-03-13 15:56:30,953 DEBUG: fetched: [(0,)]
2020-03-13 15:56:30,953 DEBUG: UPDATE state_info SET count = ? WHERE rowid = ?

Log for folder uploads

➜ dvc add s3://dvc-temp/saugat/storage/data -v
2020-03-13 15:52:45,052 DEBUG: PRAGMA user_version;
2020-03-13 15:52:45,052 DEBUG: fetched: [(3,)]
2020-03-13 15:52:45,053 DEBUG: CREATE TABLE IF NOT EXISTS state (inode INTEGER PRIMARY KEY, mtime TEXT NOT NULL, size TEXT NOT NULL, md5 TEXT NOT NULL, timestamp TEXT NOT NULL)
2020-03-13 15:52:45,053 DEBUG: CREATE TABLE IF NOT EXISTS state_info (count INTEGER)
2020-03-13 15:52:45,053 DEBUG: CREATE TABLE IF NOT EXISTS link_state (path TEXT PRIMARY KEY, inode INTEGER NOT NULL, mtime TEXT NOT NULL)
2020-03-13 15:52:45,054 DEBUG: INSERT OR IGNORE INTO state_info (count) SELECT 0 WHERE NOT EXISTS (SELECT * FROM state_info)
2020-03-13 15:52:45,054 DEBUG: PRAGMA user_version = 3;
2020-03-13 15:52:51,487 DEBUG: cache 's3://dvc-temp/saugat/tmp-1/cache/63/4763d85ae67510ffd8dfad284fa239.dir' expected '634763d85ae67510ffd8dfad284fa239.dir' actual 'None'
2020-03-13 15:52:51,488 WARNING: Output 's3://dvc-temp/saugat/storage/data' of 'data.dvc' changed because it is 'not in cache'
2020-03-13 15:53:04,810 DEBUG: Uploading '../../../../tmp/tmpj44owyz_' to 's3://dvc-temp/saugat/tmp-1/cache/.VPNaHQGJEdCKTyPxbbZ4gp.tmp'
2020-03-13 15:53:08,628 DEBUG: cache 's3://dvc-temp/saugat/tmp-1/cache/63/4763d85ae67510ffd8dfad284fa239.dir' expected '634763d85ae67510ffd8dfad284fa239.dir' actual 'None'
2020-03-13 15:53:12,151 DEBUG: Removing s3://dvc-temp/saugat/tmp-1/cache/.VPNaHQGJEdCKTyPxbbZ4gp.tmp
2020-03-13 15:53:13,452 DEBUG: {'s3://dvc-temp/saugat/storage/data': 'modified'}
2020-03-13 15:53:18,024 DEBUG: Uploading '../../../../tmp/tmp2i8sptd7' to 's3://dvc-temp/saugat/tmp-1/cache/.BhpWSzEfnS49dm9AcrjzDg.tmp'
2020-03-13 15:53:20,826 DEBUG: cache 's3://dvc-temp/saugat/tmp-1/cache/63/4763d85ae67510ffd8dfad284fa239.dir' expected '634763d85ae67510ffd8dfad284fa239.dir' actual '634763d85ae67510ffd8dfad284fa239'
2020-03-13 15:53:20,827 DEBUG: Computed stage 'data.dvc' md5: 'd09e4e34bb5c2ad7858adb8e2436bff2'
2020-03-13 15:53:21,186 DEBUG: Saving 's3://dvc-temp/saugat/storage/data' to 's3://dvc-temp/saugat/tmp-1/cache/63/4763d85ae67510ffd8dfad284fa239.dir'.
2020-03-13 15:53:24,127 DEBUG: cache 's3://dvc-temp/saugat/tmp-1/cache/d3/b07384d113edec49eaa6238ad5ff00' expected 'd3b07384d113edec49eaa6238ad5ff00' actual 'None'
2020-03-13 15:53:25,864 DEBUG: Removing s3://dvc-temp/saugat/storage/data/foo
2020-03-13 15:53:30,758 DEBUG: Created 'copy': s3://dvc-temp/saugat/tmp-1/cache/d3/b07384d113edec49eaa6238ad5ff00 -> s3://dvc-temp/saugat/storage/data/foo
2020-03-13 15:53:31,008 DEBUG: Saving information to 'data.dvc'.
100% Add|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|1/1 [00:45, 45.97s/file]

To track the changes with git, run:

	git add data.dvc
2020-03-13 15:53:31,026 DEBUG: SELECT count from state_info WHERE rowid=?
2020-03-13 15:53:31,026 DEBUG: fetched: [(0,)]
2020-03-13 15:53:31,027 DEBUG: UPDATE state_info SET count = ? WHERE rowid = ?

@shcheklein Yes, I think eventual consistency is to blame for that too. Our S3 tests have been flakey for quite a while now (probably not these days as we migrated to moto).

Another thing to note that the following is not guaranteed in S3 (emphasized), for which I think, we do quite often for external outputs:

Amazon S3 provides read-after-write consistency for PUTS of new objects in your S3 bucket in all Regions with one caveat. The caveat is that if you make a HEAD or GET request to a key name before the object is created, then create the object shortly after that, a subsequent GET might not return the object due to eventual consistency.

Thanks @skshetry , great research!!