dvc: stage add: StageExternalOutputsError on symlinked cache
Bug Report
Description
os.path.realpath
resolves a symlink here: https://github.com/iterative/dvc/blob/82c5caee27d4b5591d4ab0b07fd1a73064ba8bff/dvc/output.py#L393 so if you want to add a stage where the output is already present but is cached it will think that it is external.
Reproduce
- dvc init
- dvc config cache.dir /some/external/dir
- dvc config cache.type symlink
- run and cache something
- dvc stage add /w --out something that has already been cached
Expected
Outputout.is_in_repo
should not be False
for a symlink to the cache in the repo
Environment information
$ dvc doctor
DVC version: 2.9.1 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-5.11.0-41-generic-x86_64-with-glibc2.29
Supports:
hdfs (fsspec = 2021.10.1, pyarrow = 4.0.0),
webhdfs (fsspec = 2021.10.1),
http (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.6),
https (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.6),
s3 (s3fs = 2021.10.1, boto3 = 1.17.106),
ssh (sshfs = 2021.8.1)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sda2
Caches: local
Remotes: local, local
Workspace directory: ext4 on /dev/sda2
Repo: dvc, git
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 15 (15 by maintainers)
This is also reproducible in
dvc add
and it affects any symlink that points outside the repo directory (whether or not it’s linked to the cache dir should be irrelevant in this scenario).This breaks the documented use case where you can add a symlinked file to make DVC copy that file (from the outside-of-repo symlinked location) into the cache.
This use case has been superseded by
add --to-cache
, but if we intended to drop/deprecate the “add symlinked file” behavior we need to document it properly (and account for the specific case where the symlink points to the DVC cache directory)This was fixed by https://github.com/iterative/dvc/pull/9626