dvc: import: still creates .dvc file even when operation fails

Bug Report

Description

Even when dvc import fails, it still creates the .dvc file.

Reproduce

Without AWS sandbox credentials set, try dvc import -v git@github.com:dberenbaum/dataset.git cats-dogs. A cats-dogs.dvc file is generated.

$ dvc import -v git@github.com:dberenbaum/dataset.git cats-dogs
2023-07-31 09:11:29,074 DEBUG: v3.10.2.dev2+g5c2fd67e6, CPython 3.11.4 on macOS-13.4.1-arm64-arm-64bit
2023-07-31 09:11:29,074 DEBUG: command: /Users/dave/micromamba/envs/dvc/bin/dvc import -v git@github.com:dberenbaum/dataset.git cats-dogs
2023-07-31 09:11:29,287 DEBUG: Removing output 'cats-dogs' of stage: 'cats-dogs.dvc'.
2023-07-31 09:11:29,287 DEBUG: Removing '/Users/dave/repo/cats-dogs'
Importing 'cats-dogs (git@github.com:dberenbaum/dataset.git)' -> 'cats-dogs'
2023-07-31 09:11:29,288 DEBUG: Computed stage: 'cats-dogs.dvc' md5: '9ce0eabbbaa6ff8015aef3312fd8d213'
2023-07-31 09:11:29,288 DEBUG: 'md5' of stage: 'cats-dogs.dvc' changed.
2023-07-31 09:11:29,288 DEBUG: Creating external repo git@github.com:dberenbaum/dataset.git@None
2023-07-31 09:11:29,289 DEBUG: erepo: git clone 'git@github.com:dberenbaum/dataset.git' to a temporary dir
2023-07-31 09:11:33,155 ERROR: failed to load ('cats-dogs',) - Unable to locate credentials
Traceback (most recent call last):
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/index/index.py", line 541, in _load_from_storage
    _load_from_object_storage(trie, entry, storage)
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/index/index.py", line 477, in _load_from_object_storage
    obj = Tree.load(
          ^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/hashfile/tree.py", line 191, in load
    with obj.fs.open(obj.path, "r") as fobj:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_objects/fs/base.py", line 222, in open
    return self.fs.open(path, mode=mode, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/fsspec/spec.py", line 1229, in open
    self.open(
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/fsspec/spec.py", line 1241, in open
    f = self._open(
        ^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/s3fs/core.py", line 659, in _open
    return S3File(
           ^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/s3fs/core.py", line 2066, in __init__
    super().__init__(
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/fsspec/spec.py", line 1597, in __init__
    self.size = self.details["size"]
                ^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/fsspec/spec.py", line 1610, in details
    self._details = self.fs.info(self.path)
                    ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/fsspec/asyn.py", line 121, in wrapper
    return sync(self.loop, func, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/fsspec/asyn.py", line 106, in sync
    raise return_result
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/fsspec/asyn.py", line 61, in _runner
    result[0] = await coro
                ^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/s3fs/core.py", line 1271, in _info
    out = await self._call_s3(
          ^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/s3fs/core.py", line 342, in _call_s3
    s3 = await self.get_s3(kwargs.get("Bucket"))
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/s3fs/core.py", line 336, in get_s3
    return await self._s3creator.get_bucket_client(bucket)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/s3fs/utils.py", line 39, in get_bucket_client
    response = await general_client.head_bucket(Bucket=bucket_name)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/aiobotocore/client.py", line 354, in _make_api_call
    http, parsed_response = await self._make_request(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/aiobotocore/client.py", line 379, in _make_request
    return await self._endpoint.make_request(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/aiobotocore/endpoint.py", line 96, in _send_request
    request = await self.create_request(request_dict, operation_model)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/aiobotocore/endpoint.py", line 84, in create_request
    await self._event_emitter.emit(
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/aiobotocore/hooks.py", line 66, in _emit
    response = await resolve_awaitable(handler(**kwargs))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/aiobotocore/_helpers.py", line 15, in resolve_awaitable
    return await obj
           ^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/aiobotocore/signers.py", line 24, in handler
    return await self.sign(operation_name, request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/aiobotocore/signers.py", line 82, in sign
    auth.add_auth(request)
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/botocore/auth.py", line 418, in add_auth
    raise NoCredentialsError()
botocore.exceptions.NoCredentialsError: Unable to locate credentials

2023-07-31 09:11:33,163 WARNING: 'cats-dogs' is empty.
2023-07-31 09:11:33,163 DEBUG: Added '/Users/dave/repo/cats-dogs' to gitignore file.
2023-07-31 09:11:33,165 DEBUG: built tree 'object d751713988987e9331980363e24189ce.dir'
2023-07-31 09:11:33,165 DEBUG: Computed stage: 'cats-dogs.dvc' md5: 'd1c1ff68c486e7407aa917addbde4316'
2023-07-31 09:11:33,166 DEBUG: built tree 'object d751713988987e9331980363e24189ce.dir'
2023-07-31 09:11:33,167 DEBUG: Preparing to transfer data from 'memory://dvc-staging-md5/e491ff18e115f87446131d48702971fec1cf49467aabcca1cbce150e09fb0192' to '/Users/dave/repo/.dvc/cache/files/md5'
2023-07-31 09:11:33,167 DEBUG: Preparing to collect status from '/Users/dave/repo/.dvc/cache/files/md5'
2023-07-31 09:11:33,167 DEBUG: Collecting status from '/Users/dave/repo/.dvc/cache/files/md5'
2023-07-31 09:11:33,168 DEBUG: Preparing to collect status from 'memory://dvc-staging-md5/e491ff18e115f87446131d48702971fec1cf49467aabcca1cbce150e09fb0192'
2023-07-31 09:11:33,168 DEBUG: transfer dir: md5: d751713988987e9331980363e24189ce.dir with 0 files
2023-07-31 09:11:33,169 DEBUG: built tree 'object d751713988987e9331980363e24189ce.dir'
2023-07-31 09:11:33,170 DEBUG: Removing '/Users/dave/repo/.2XQmKjhqscixAiPLGzL6ve.tmp'
2023-07-31 09:11:33,170 DEBUG: Removing '/Users/dave/repo/.2XQmKjhqscixAiPLGzL6ve.tmp'
2023-07-31 09:11:33,170 DEBUG: Removing '/Users/dave/repo/.dvc/cache/files/md5/.LHZ5pSSNtj4Nty4zXX7cSe.tmp'
2023-07-31 09:11:33,172 DEBUG: Saving information to 'cats-dogs.dvc'.
2023-07-31 09:11:33,173 DEBUG: Staging files: {'.gitignore', 'cats-dogs.dvc'}
2023-07-31 09:11:33,176 DEBUG: Analytics is disabled.

Expected

No cats-dogs.dvc file to be created.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 25 (23 by maintainers)

Commits related to this issue

Most upvoted comments

Update: it’s a bigger problem that’s not really about error handling at all. We are corrupting the cache, which is probably at the root of a lot of the recent issues.

This isn’t really corrupting cache. d751713988987e9331980363e24189ce.dir is the correct hash for an empty directory. The issue is that right now the underlying dvcfs.find() for the import source returns an empty directory instead of erroring out. So DVC is generating the correct cache entry for what it thinks is supposed to be an empty directory.

(DVC is not generating a .dir file that is supposed to contain files and then skipping files the in that .dir)

Edit: so we aren’t getting any s3 errors here because dvc thinks it’s an empty directory.

This is backwards. The issue is that we are getting s3 errors (in dvc-data), but those errors were being suppressed and dvc-data was just returning that the directory was empty (instead of re-raising the errors to DVC).


The error handling has already been fixed in dvc-data by @efiop, and once the corresponding changes are made in DVC all of the import-related symptoms for this underlying issue should be resolved without the user needing to do anything. After the error handling is fixed, fs.find() will return actual files (assuming the user has fixed their credentials), so dvc status will show that the import has changed (and running dvc update would generate a new dvc file that contains the actual hash for the imported dir).

This isn’t really corrupting cache. d751713988987e9331980363e24189ce.dir is the correct hash for an empty directory.

We were corrupting the “index cache” in site_cache_dir

not really about error handling at all

I guess it’s still related since the workflow is:

  1. Collect info from .dir file
  2. .dir file not found but no error raised because we catch FileNotFoundError
  3. Assume the dir is empty since no paths are found
  4. Cache gets corrupted with empty .dir file

Update: it’s a bigger problem that’s not really about error handling at all. We are corrupting the cache, which is probably at the root of a lot of the recent issues.

$ dvc import git@github.com:dberenbaum/dataset2.git cats-dogs
Importing 'cats-dogs (git@github.com:dberenbaum/dataset2.git)' -> 'cats-dogs'
WARNING: 'cats-dogs' is empty.

$ cat cats-dogs.dvc
md5: 80b00298143632f1990d8979ea98a437
frozen: true
deps:
- path: cats-dogs
  repo:
    url: git@github.com:dberenbaum/dataset2.git
    rev_lock: 00db5587e8b4ee2f78421523c7db7434727e4774
outs:
- md5: d751713988987e9331980363e24189ce.dir
  size: 0
  nfiles: 0
  hash: md5
  path: cats-dogs

$ cat .dvc/cache/files/md5/d7/51713988987e9331980363e24189ce.dir
[]%

Edit: so we aren’t getting any s3 errors here because dvc thinks it’s an empty directory.

Thanks @efiop, this seems to fix it if the credentials are missing since that will throw an exception for being unable to locate credentials, but I think I was too hasty saying that would fully fix the problem.

If the credentials are wrong, it generates a FileNotFoundError, and that still doesn’t get raised because of this block.

I’m not sure what the best solution here is and haven’t had a chance to look into why we don’t raise FileNotFoundError there, or why there isn’t something like a 403 error raised anywhere.

For testing, can we test in both dvc-data (for index methods and exception handling) and dvc (end-to-end download/import scenario)?

@dberenbaum Yeah, that’s what we’ve discussed today partially. I’ll introduce a mechanism for that shortly.

Looks like it’s ultimately coming from https://github.com/iterative/dvc-data/pull/409. Was this for studio @efiop? Can we handle these exceptions more carefully and raise them?

~Edit: You can skip this entire comment and go to the next one.~

It’s coming from here: https://github.com/iterative/dvc/blob/b235487eccb2a6c1d84bc0ef279b3d774003fee2/dvc/fs/__init__.py#L60-L64

That throws an error but it gets suppressed (~I can’t tell how or why~). We should be raising that error.

Edit: I found that it’s coming from here:

https://github.com/iterative/dvc/blob/b235487eccb2a6c1d84bc0ef279b3d774003fee2/dvc/fs/dvc.py#L291

Edit 2: Even if we remove that suppression, we still catch the exception here and fail to raise it 🤦 :

https://github.com/iterative/dvc/blob/b235487eccb2a6c1d84bc0ef279b3d774003fee2/dvc/fs/dvc.py#L367-L369

No, sorry for the confusion @farhanhubble. Currently, it’s working like this:

dvc import --no-download # -> Creates a .dvc with empty outs
dvc update    # -> Corrupts the .dvc outs on failed download

The previous expected behavior was:

dvc import --no-download # -> Creates a .dvc with empty outs
dvc update    # -> Leaves .dvc untouched on failed download