dvc: dvc push: when not all files are in local cache, still pushes the hash.dir file, breaking remote repository.

EDIT: Solved. The issue was pushing incomplete datasets from my gpu server to the new storage remotes. It pushed all the files present in the local cache (which is what I wanted) but then it pushed the hash.dir listing, blocking my local machine from uploading the rest of the files that were not present on the cloud server. This is a pretty serious bug IMO (even though it was accidental and poor practice!)

Bug Report

Please provide information about your setup

Output of dvc version:

$ dvc version
DVC version: 1.3.1 (pip)
---------------------------------
Platform: Python 3.7.7 on Linux-5.3.0-1032-aws-x86_64-with-debian-buster-sid
Supports: http, https, s3
Cache types: hardlink, symlink
Repo: dvc, git

Additional Information (if any):

On this VPS I get the following errors repeatedly when trying to pull from my s3 storage:

2020-08-05 18:32:52,114 ERROR: failed to download 's3://[xxx]/repo.dvc/69/763f0cecd801483a1490a0b2a0b84d' to '.dvc/cache/69/763f0cecd801483a1490a0b2a0b84d' - An error occurred (404) when calling the HeadObject operation: Not Found
------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ubuntu/.venv-dvc/lib/python3.7/site-packages/dvc/cache/local.py", line 30, in wrapper
    func(from_info, to_info, *args, **kwargs)
  File "/home/ubuntu/.venv-dvc/lib/python3.7/site-packages/dvc/tree/base.py", line 420, in download
    from_info, to_info, name, no_progress_bar, file_mode, dir_mode
  File "/home/ubuntu/.venv-dvc/lib/python3.7/site-packages/dvc/tree/base.py", line 478, in _download_file
    from_info, tmp_file, name=name, no_progress_bar=no_progress_bar
  File "/home/ubuntu/.venv-dvc/lib/python3.7/site-packages/dvc/tree/s3.py", line 341, in _download
    Bucket=from_info.bucket, Key=from_info.path
  File "/home/ubuntu/.venv-dvc/lib/python3.7/site-packages/botocore/client.py", line 316, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/ubuntu/.venv-dvc/lib/python3.7/site-packages/botocore/client.py", line 635, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found

This had happened on a ec2-linux VPS. I tried again on an Ubuntu Deep Learning AMI, then again with a fresh python3 virtualenv with only dvc installed. I have not been able to replicated this on any of my local workstations. They are able to clone the dvc directories just fine. Even pushing from one and pulling on another.

Also, on any machine, aws s3 ls ... does not return any thing for the hashes it is searching for on s3. But, I am able to clone the .dvc on my other machines… I am stumped…

For the record, one local dvc version:

DVC version: 1.3.1 (pip)
---------------------------------
Platform: Python 3.6.9 on Linux-5.4.0-42-generic-x86_64-with-Ubuntu-18.04-bionic
Supports: http, https, s3, ssh
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sda2
Workspace directory: ext4 on /dev/sda2
Repo: dvc, git

If applicable, please also provide a --verbose output of the command, eg: dvc add --verbose.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 21 (21 by maintainers)

Most upvoted comments

Can reproduce with this test

def test_push_incomplete_dir(tmp_dir, dvc, mocker, local_remote):
    (stage,) = tmp_dir.dvc_gen({"dir": {"foo": "foo", "bar": "bar"}})
    remote = dvc.cloud.get_remote("upstream")

    cache = dvc.cache.local
    dir_hash = stage.outs[0].checksum
    used = stage.get_used_cache(remote=remote)

    # remove one of the local cache files for directory
    file_hash = first(used.child_keys(cache.tree.scheme, dir_hash))
    remove(cache.tree.hash_to_path_info(file_hash))

    dvc.push()
    assert not remote.tree.exists(remote.tree.hash_to_path_info(dir_hash))

Each file and dir hash to upload has it’s own executor, and for the dir hash we run:

https://github.com/iterative/dvc/blob/e2a574137430a6beacb86d4eb3ff8d7e4fca6734/dvc/cache/local.py#L382

which should wait on the list of executors for each file contained in the directory to finish, and then only pushes the final dir hash if all files were uploaded successfully.

Can you run dvc push -v ... from your gpu machine and post the output?

edit: actually, thinking about it now we might not be accounting for the case where the list of files to push is incomplete from the start. So we upload the “incomplete” list without any errors, and then treat that as successfully uploading the full directory

Sure, just had to get back to a terminal and boot it up.

$ dvc version
DVC version: 1.3.1 (pip)
---------------------------------
Platform: Python 3.7.7 on Linux-4.14.181-142.260.amzn2.x86_64-x86_64-with-glibc2.10
Supports: http, https, s3, ssh

It is possible I had a (daily) version mix-up but I just don’t see this as a possible outcome of only that?

I’m just really confused that it is checking the remote cache when pushing and not actually sending the missing hashes.

There was a change made in DVC 1.0 where we started indexing remotes for performance reasons. It’s possible that if something went wrong in the middle of a dvc push in < 1.0, our index could get into a state where DVC on your local machine mistakenly assumes files are already in your S3 remote (and as a result says everything is up-to-date when you run push).

Could you make sure that DVC on your local machine is up to date, and then run the following on that machine:

rm -rf .dvc/tmp/index
dvc push -r s3