dvc: stage add: StageExternalOutputsError on symlinked cache

Bug Report

Description

os.path.realpath resolves a symlink here: https://github.com/iterative/dvc/blob/82c5caee27d4b5591d4ab0b07fd1a73064ba8bff/dvc/output.py#L393 so if you want to add a stage where the output is already present but is cached it will think that it is external.

Reproduce

  1. dvc init
  2. dvc config cache.dir /some/external/dir
  3. dvc config cache.type symlink
  4. run and cache something
  5. dvc stage add /w --out something that has already been cached

Expected

Outputout.is_in_repo should not be False for a symlink to the cache in the repo

Environment information

$ dvc doctor
DVC version: 2.9.1 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-5.11.0-41-generic-x86_64-with-glibc2.29
Supports:
        hdfs (fsspec = 2021.10.1, pyarrow = 4.0.0),
        webhdfs (fsspec = 2021.10.1),
        http (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.6),
        s3 (s3fs = 2021.10.1, boto3 = 1.17.106),
        ssh (sshfs = 2021.8.1)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sda2
Caches: local
Remotes: local, local
Workspace directory: ext4 on /dev/sda2
Repo: dvc, git

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 15 (15 by maintainers)

Most upvoted comments

This is also reproducible in dvc add and it affects any symlink that points outside the repo directory (whether or not it’s linked to the cache dir should be irrelevant in this scenario).

test-stage-add git:master ✓ ✩  py:dvc ❯ la
total 8.0K
drwxr-xr-x  6 pmrowla staff 192 Dec 15 11:01 .dvc
-rw-r--r--  1 pmrowla staff 139 Dec 15 10:51 .dvcignore
drwxr-xr-x 11 pmrowla staff 352 Dec 15 11:09 .git
-rw-r--r--  1 pmrowla staff   5 Dec 15 10:54 .gitignore
lrwxr-xr-x  1 pmrowla staff   6 Dec 15 11:02 foo -> ../foo

test-stage-add git:master ✓ ✩  py:dvc ❯ dvc add -v foo
2021-12-15 11:09:21,483 DEBUG: Adding '/Users/pmrowla/git/scratch/test-stage-add/.dvc/config.local' to gitignore file.
2021-12-15 11:09:21,490 DEBUG: Adding '/Users/pmrowla/git/scratch/test-stage-add/.dvc/tmp' to gitignore file.
2021-12-15 11:09:21,490 DEBUG: Adding '/Users/pmrowla/git/scratch/test-stage-add/.dvc/cache' to gitignore file.
2021-12-15 11:09:21,505 ERROR: Output(s) outside of DVC project: foo. See <https://dvc.org/doc/user-guide/managing-external-data> for more info.
------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/pmrowla/git/dvc/dvc/command/add.py", line 21, in run
    self.repo.add(
  File "/Users/pmrowla/git/dvc/dvc/utils/collections.py", line 163, in inner
    result = func(*ba.args, **ba.kwargs)
  File "/Users/pmrowla/git/dvc/dvc/repo/__init__.py", line 49, in wrapper
    return f(repo, *args, **kwargs)
  File "/Users/pmrowla/git/dvc/dvc/repo/scm_context.py", line 152, in run
    return method(repo, *args, **kw)
  File "/Users/pmrowla/git/dvc/dvc/repo/add.py", line 171, in add
    stages = list(ui.progress(stages_it, desc=desc, unit="file"))
  File "/Users/pmrowla/.virtualenvs/dvc/lib/python3.9/site-packages/tqdm/std.py", line 1180, in __iter__
    for obj in iterable:
  File "/Users/pmrowla/git/dvc/dvc/repo/add.py", line 259, in create_stages
    stage = repo.stage.create(
  File "/Users/pmrowla/git/dvc/dvc/repo/stage.py", line 181, in create
    stage = create_stage(
  File "/Users/pmrowla/git/dvc/dvc/stage/__init__.py", line 88, in create_stage
    check_no_externals(stage)
  File "/Users/pmrowla/git/dvc/dvc/stage/utils.py", line 136, in check_no_externals
    raise StageExternalOutputsError(
dvc.stage.exceptions.StageExternalOutputsError: Output(s) outside of DVC project: foo. See <https://dvc.org/doc/user-guide/managing-external-data> for more info.
------------------------------------------------------------

This breaks the documented use case where you can add a symlinked file to make DVC copy that file (from the outside-of-repo symlinked location) into the cache.

This use case has been superseded by add --to-cache, but if we intended to drop/deprecate the “add symlinked file” behavior we need to document it properly (and account for the specific case where the symlink points to the DVC cache directory)