dvc: import: still creates .dvc file even when operation fails
Bug Report
Description
Even when dvc import
fails, it still creates the .dvc
file.
Reproduce
Without AWS sandbox credentials set, try dvc import -v git@github.com:dberenbaum/dataset.git cats-dogs
. A cats-dogs.dvc
file is generated.
$ dvc import -v git@github.com:dberenbaum/dataset.git cats-dogs
2023-07-31 09:11:29,074 DEBUG: v3.10.2.dev2+g5c2fd67e6, CPython 3.11.4 on macOS-13.4.1-arm64-arm-64bit
2023-07-31 09:11:29,074 DEBUG: command: /Users/dave/micromamba/envs/dvc/bin/dvc import -v git@github.com:dberenbaum/dataset.git cats-dogs
2023-07-31 09:11:29,287 DEBUG: Removing output 'cats-dogs' of stage: 'cats-dogs.dvc'.
2023-07-31 09:11:29,287 DEBUG: Removing '/Users/dave/repo/cats-dogs'
Importing 'cats-dogs (git@github.com:dberenbaum/dataset.git)' -> 'cats-dogs'
2023-07-31 09:11:29,288 DEBUG: Computed stage: 'cats-dogs.dvc' md5: '9ce0eabbbaa6ff8015aef3312fd8d213'
2023-07-31 09:11:29,288 DEBUG: 'md5' of stage: 'cats-dogs.dvc' changed.
2023-07-31 09:11:29,288 DEBUG: Creating external repo git@github.com:dberenbaum/dataset.git@None
2023-07-31 09:11:29,289 DEBUG: erepo: git clone 'git@github.com:dberenbaum/dataset.git' to a temporary dir
2023-07-31 09:11:33,155 ERROR: failed to load ('cats-dogs',) - Unable to locate credentials
Traceback (most recent call last):
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/index/index.py", line 541, in _load_from_storage
_load_from_object_storage(trie, entry, storage)
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/index/index.py", line 477, in _load_from_object_storage
obj = Tree.load(
^^^^^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/hashfile/tree.py", line 191, in load
with obj.fs.open(obj.path, "r") as fobj:
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_objects/fs/base.py", line 222, in open
return self.fs.open(path, mode=mode, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/fsspec/spec.py", line 1229, in open
self.open(
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/fsspec/spec.py", line 1241, in open
f = self._open(
^^^^^^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/s3fs/core.py", line 659, in _open
return S3File(
^^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/s3fs/core.py", line 2066, in __init__
super().__init__(
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/fsspec/spec.py", line 1597, in __init__
self.size = self.details["size"]
^^^^^^^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/fsspec/spec.py", line 1610, in details
self._details = self.fs.info(self.path)
^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/fsspec/asyn.py", line 121, in wrapper
return sync(self.loop, func, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/fsspec/asyn.py", line 106, in sync
raise return_result
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/fsspec/asyn.py", line 61, in _runner
result[0] = await coro
^^^^^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/s3fs/core.py", line 1271, in _info
out = await self._call_s3(
^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/s3fs/core.py", line 342, in _call_s3
s3 = await self.get_s3(kwargs.get("Bucket"))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/s3fs/core.py", line 336, in get_s3
return await self._s3creator.get_bucket_client(bucket)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/s3fs/utils.py", line 39, in get_bucket_client
response = await general_client.head_bucket(Bucket=bucket_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/aiobotocore/client.py", line 354, in _make_api_call
http, parsed_response = await self._make_request(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/aiobotocore/client.py", line 379, in _make_request
return await self._endpoint.make_request(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/aiobotocore/endpoint.py", line 96, in _send_request
request = await self.create_request(request_dict, operation_model)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/aiobotocore/endpoint.py", line 84, in create_request
await self._event_emitter.emit(
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/aiobotocore/hooks.py", line 66, in _emit
response = await resolve_awaitable(handler(**kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/aiobotocore/_helpers.py", line 15, in resolve_awaitable
return await obj
^^^^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/aiobotocore/signers.py", line 24, in handler
return await self.sign(operation_name, request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/aiobotocore/signers.py", line 82, in sign
auth.add_auth(request)
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/botocore/auth.py", line 418, in add_auth
raise NoCredentialsError()
botocore.exceptions.NoCredentialsError: Unable to locate credentials
2023-07-31 09:11:33,163 WARNING: 'cats-dogs' is empty.
2023-07-31 09:11:33,163 DEBUG: Added '/Users/dave/repo/cats-dogs' to gitignore file.
2023-07-31 09:11:33,165 DEBUG: built tree 'object d751713988987e9331980363e24189ce.dir'
2023-07-31 09:11:33,165 DEBUG: Computed stage: 'cats-dogs.dvc' md5: 'd1c1ff68c486e7407aa917addbde4316'
2023-07-31 09:11:33,166 DEBUG: built tree 'object d751713988987e9331980363e24189ce.dir'
2023-07-31 09:11:33,167 DEBUG: Preparing to transfer data from 'memory://dvc-staging-md5/e491ff18e115f87446131d48702971fec1cf49467aabcca1cbce150e09fb0192' to '/Users/dave/repo/.dvc/cache/files/md5'
2023-07-31 09:11:33,167 DEBUG: Preparing to collect status from '/Users/dave/repo/.dvc/cache/files/md5'
2023-07-31 09:11:33,167 DEBUG: Collecting status from '/Users/dave/repo/.dvc/cache/files/md5'
2023-07-31 09:11:33,168 DEBUG: Preparing to collect status from 'memory://dvc-staging-md5/e491ff18e115f87446131d48702971fec1cf49467aabcca1cbce150e09fb0192'
2023-07-31 09:11:33,168 DEBUG: transfer dir: md5: d751713988987e9331980363e24189ce.dir with 0 files
2023-07-31 09:11:33,169 DEBUG: built tree 'object d751713988987e9331980363e24189ce.dir'
2023-07-31 09:11:33,170 DEBUG: Removing '/Users/dave/repo/.2XQmKjhqscixAiPLGzL6ve.tmp'
2023-07-31 09:11:33,170 DEBUG: Removing '/Users/dave/repo/.2XQmKjhqscixAiPLGzL6ve.tmp'
2023-07-31 09:11:33,170 DEBUG: Removing '/Users/dave/repo/.dvc/cache/files/md5/.LHZ5pSSNtj4Nty4zXX7cSe.tmp'
2023-07-31 09:11:33,172 DEBUG: Saving information to 'cats-dogs.dvc'.
2023-07-31 09:11:33,173 DEBUG: Staging files: {'.gitignore', 'cats-dogs.dvc'}
2023-07-31 09:11:33,176 DEBUG: Analytics is disabled.
Expected
No cats-dogs.dvc
file to be created.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 25 (23 by maintainers)
Commits related to this issue
- index: introduce basic onerror Also fixes legacy TreeError/DataIndexError mess and doesn't mark directories that we've failed to load as loaded in cache, preventing cache corruption. Kudos @dberenba... — committed to efiop/dvc-data by efiop a year ago
- index: introduce basic onerror Also fixes legacy TreeError/DataIndexError mess and doesn't mark directories that we've failed to load as loaded in cache, preventing cache corruption. Kudos @dberenba... — committed to efiop/dvc-data by efiop a year ago
- index: introduce basic onerror Also fixes legacy TreeError/DataIndexError mess and doesn't mark directories that we've failed to load as loaded in cache, preventing cache corruption. Kudos @dberenba... — committed to efiop/dvc-data by efiop a year ago
- index: introduce basic onerror Also fixes legacy TreeError/DataIndexError mess and doesn't mark directories that we've failed to load as loaded in cache, preventing cache corruption. Kudos @dberenba... — committed to efiop/dvc-data by efiop a year ago
- index: introduce basic onerror Also fixes legacy TreeError/DataIndexError mess and doesn't mark directories that we've failed to load as loaded in cache, preventing cache corruption. Kudos @dberenba... — committed to efiop/dvc-data by efiop a year ago
- index: introduce basic onerror Also fixes legacy TreeError/DataIndexError mess and doesn't mark directories that we've failed to load as loaded in cache, preventing cache corruption. Kudos @dberenba... — committed to efiop/dvc-data by efiop a year ago
- index: introduce basic onerror Also fixes legacy TreeError/DataIndexError mess and doesn't mark directories that we've failed to load as loaded in cache, preventing cache corruption. Kudos @dberenba... — committed to efiop/dvc-data by efiop a year ago
- index: introduce basic onerror Also fixes legacy TreeError/DataIndexError mess and doesn't mark directories that we've failed to load as loaded in cache, preventing cache corruption. Kudos @dberenba... — committed to efiop/dvc-data by efiop a year ago
- index: introduce basic onerror Also fixes legacy TreeError/DataIndexError mess and doesn't mark directories that we've failed to load as loaded in cache, preventing cache corruption. Kudos @dberenba... — committed to efiop/dvc-data by efiop a year ago
- index: introduce basic onerror Also fixes legacy TreeError/DataIndexError mess and doesn't mark directories that we've failed to load as loaded in cache, preventing cache corruption. Kudos @dberenba... — committed to iterative/dvc-data by efiop a year ago
This isn’t really corrupting cache.
d751713988987e9331980363e24189ce.dir
is the correct hash for an empty directory. The issue is that right now the underlyingdvcfs.find()
for the import source returns an empty directory instead of erroring out. So DVC is generating the correct cache entry for what it thinks is supposed to be an empty directory.(DVC is not generating a .dir file that is supposed to contain files and then skipping files the in that .dir)
This is backwards. The issue is that we are getting s3 errors (in dvc-data), but those errors were being suppressed and dvc-data was just returning that the directory was empty (instead of re-raising the errors to DVC).
The error handling has already been fixed in dvc-data by @efiop, and once the corresponding changes are made in DVC all of the import-related symptoms for this underlying issue should be resolved without the user needing to do anything. After the error handling is fixed,
fs.find()
will return actual files (assuming the user has fixed their credentials), sodvc status
will show that the import has changed (and runningdvc update
would generate a new dvc file that contains the actual hash for the imported dir).We were corrupting the “index cache” in
site_cache_dir
I guess it’s still related since the workflow is:
.dir
file.dir
file not found but no error raised because we catchFileNotFoundError
.dir
fileUpdate: it’s a bigger problem that’s not really about error handling at all. We are corrupting the cache, which is probably at the root of a lot of the recent issues.
Edit: so we aren’t getting any s3 errors here because dvc thinks it’s an empty directory.
Thanks @efiop, this seems to fix it if the credentials are missing since that will throw an exception for being unable to locate credentials, but I think I was too hasty saying that would fully fix the problem.
If the credentials are wrong, it generates a
FileNotFoundError
, and that still doesn’t get raised because of this block.I’m not sure what the best solution here is and haven’t had a chance to look into why we don’t raise
FileNotFoundError
there, or why there isn’t something like a 403 error raised anywhere.For testing, can we test in both dvc-data (for index methods and exception handling) and dvc (end-to-end download/import scenario)?
@dberenbaum Yeah, that’s what we’ve discussed today partially. I’ll introduce a mechanism for that shortly.
Looks like it’s ultimately coming from https://github.com/iterative/dvc-data/pull/409. Was this for studio @efiop? Can we handle these exceptions more carefully and raise them?
~Edit: You can skip this entire comment and go to the next one.~
It’s coming from here: https://github.com/iterative/dvc/blob/b235487eccb2a6c1d84bc0ef279b3d774003fee2/dvc/fs/__init__.py#L60-L64
That throws an error but it gets suppressed (~I can’t tell how or why~). We should be raising that error.
Edit: I found that it’s coming from here:
https://github.com/iterative/dvc/blob/b235487eccb2a6c1d84bc0ef279b3d774003fee2/dvc/fs/dvc.py#L291
Edit 2: Even if we remove that suppression, we still catch the exception here and fail to raise it 🤦 :
https://github.com/iterative/dvc/blob/b235487eccb2a6c1d84bc0ef279b3d774003fee2/dvc/fs/dvc.py#L367-L369
No, sorry for the confusion @farhanhubble. Currently, it’s working like this:
The previous expected behavior was: