dvc: dvc push: when not all files are in local cache, still pushes the hash.dir file, breaking remote repository.
EDIT: Solved. The issue was pushing incomplete datasets from my gpu server to the new storage remotes. It pushed all the files present in the local cache (which is what I wanted) but then it pushed the hash.dir listing, blocking my local machine from uploading the rest of the files that were not present on the cloud server. This is a pretty serious bug IMO (even though it was accidental and poor practice!)
Bug Report
Please provide information about your setup
Output of dvc version
:
$ dvc version
DVC version: 1.3.1 (pip)
---------------------------------
Platform: Python 3.7.7 on Linux-5.3.0-1032-aws-x86_64-with-debian-buster-sid
Supports: http, https, s3
Cache types: hardlink, symlink
Repo: dvc, git
Additional Information (if any):
On this VPS I get the following errors repeatedly when trying to pull from my s3 storage:
2020-08-05 18:32:52,114 ERROR: failed to download 's3://[xxx]/repo.dvc/69/763f0cecd801483a1490a0b2a0b84d' to '.dvc/cache/69/763f0cecd801483a1490a0b2a0b84d' - An error occurred (404) when calling the HeadObject operation: Not Found
------------------------------------------------------------
Traceback (most recent call last):
File "/home/ubuntu/.venv-dvc/lib/python3.7/site-packages/dvc/cache/local.py", line 30, in wrapper
func(from_info, to_info, *args, **kwargs)
File "/home/ubuntu/.venv-dvc/lib/python3.7/site-packages/dvc/tree/base.py", line 420, in download
from_info, to_info, name, no_progress_bar, file_mode, dir_mode
File "/home/ubuntu/.venv-dvc/lib/python3.7/site-packages/dvc/tree/base.py", line 478, in _download_file
from_info, tmp_file, name=name, no_progress_bar=no_progress_bar
File "/home/ubuntu/.venv-dvc/lib/python3.7/site-packages/dvc/tree/s3.py", line 341, in _download
Bucket=from_info.bucket, Key=from_info.path
File "/home/ubuntu/.venv-dvc/lib/python3.7/site-packages/botocore/client.py", line 316, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/home/ubuntu/.venv-dvc/lib/python3.7/site-packages/botocore/client.py", line 635, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found
This had happened on a ec2-linux VPS. I tried again on an Ubuntu Deep Learning AMI, then again with a fresh python3 virtualenv with only dvc installed. I have not been able to replicated this on any of my local workstations. They are able to clone the dvc directories just fine. Even pushing from one and pulling on another.
Also, on any machine, aws s3 ls ...
does not return any thing for the hashes it is searching for on s3. But, I am able to clone the .dvc on my other machines… I am stumped…
For the record, one local dvc version:
DVC version: 1.3.1 (pip)
---------------------------------
Platform: Python 3.6.9 on Linux-5.4.0-42-generic-x86_64-with-Ubuntu-18.04-bionic
Supports: http, https, s3, ssh
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sda2
Workspace directory: ext4 on /dev/sda2
Repo: dvc, git
If applicable, please also provide a --verbose
output of the command, eg: dvc add --verbose
.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 21 (21 by maintainers)
Can reproduce with this test
Each file and dir hash to upload has it’s own executor, and for the dir hash we run:
https://github.com/iterative/dvc/blob/e2a574137430a6beacb86d4eb3ff8d7e4fca6734/dvc/cache/local.py#L382
which should wait on the list of executors for each file contained in the directory to finish, and then only pushes the final dir hash if all files were uploaded successfully.
Can you run
dvc push -v ...
from your gpu machine and post the output?edit: actually, thinking about it now we might not be accounting for the case where the list of files to push is incomplete from the start. So we upload the “incomplete” list without any errors, and then treat that as successfully uploading the full directory
Sure, just had to get back to a terminal and boot it up.
There was a change made in DVC 1.0 where we started indexing remotes for performance reasons. It’s possible that if something went wrong in the middle of a
dvc push
in < 1.0, our index could get into a state where DVC on your local machine mistakenly assumes files are already in your S3 remote (and as a result says everything is up-to-date when you run push).Could you make sure that DVC on your local machine is up to date, and then run the following on that machine: