dvc: `pull`: `-R` does not check immediate target

Bug Report

Description

Firstly, from the docs I realise that pull -R <target> is probably working exactly as advertised.

In the VS Code extension, we show a tracked tree which can be used to selectively pull files from the remote.

We currently use the output of dvc list . -R --show-json --dvc-only to generate this tree (we will shortly be using the output from the new data:status command). We mark everything provided by the list output as tracked.

When calling pull against these tracked paths we check to see if the path exists in the list output. If it does then we call dvc pull <target>. If it does not we call dvc pull -R <target>.

When calling dvc pull -R we get mixed results. Here is an example of -R stating that everything is up to date when things clearly haven’t changed:

https://user-images.githubusercontent.com/37993418/168737919-52548709-2a98-4f30-8658-53bd16c2b709.mov

dvc.yaml for the above project is here. training_metrics is tracked but there is no way currently for us to easily/consistently tell this from the combined output of list, status & diff.

Reproduce

  1. Open demo project for the first time.
  2. Run dvc pull -R training_metrics from the root.
  3. “everything is up to date” will be returned by the command
  4. No data will have been updated.

Expected

dvc pull -R target checks the target as well as all searching inside the target.

We could take the alternative approach of including the appropriate information in the new data:status command. I.e training_metrics/ would be provided as part of the output to let us know that it is tracked.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.10.2 (pip)
---------------------------------
Platform: Python 3.9.9 on macOS-12.3.1-x86_64-i386-64bit
Supports:
        webhdfs (fsspec = 2022.3.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        s3 (s3fs = 2022.3.0, boto3 = 1.21.21)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s5s1
Caches: local
Remotes: s3
Workspace directory: apfs on /dev/disk1s5s1
Repo: dvc (subdir), git

Additional Information (if any):

Please let me know if you need anything else from me. Thanks

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 34 (34 by maintainers)

Most upvoted comments

@skshetry yep, usually I’m testing on the dev version, in this case dvc was coming from a different project. It is improved to 0.21! 🎉

yep, I right click on data ask to pull and exepct it to bring me data/MNIST/raw inside. W/o me going two level down (I just don’t even know which one is tracked). Or in the example-get-started I’d like to do dvc pull -R data to download data.xml and some intemediate things inside … again, it’a all about simple and intuitive interface to manipulate data.

Current behavior is not useful at all to my mind and comes from some legacy (when dvc pull was taking only .dvc files as targets).

Yup, makes sense. We need to move all our commands towards operating on all DVC-tracked data within a path without the users worrying about where the paths are specified in .dvc, dvc.yaml, etc. I think the current dvc pull -R logic is how most DVC commands work, so I would like to have a more systematic effort to change it across commands rather than have inconsistent behavior.

I think it’s somewhat related to the goal to “auto manage directories,” which is currently planned for Q3, and dvc data status is planned with this in mind.

@shcheklein @mattseddon What is the priority for VS Code (when do you need it)?

@efiop Any thoughts?

Hm, not sure. For me it’s <0.6 seconds. Still not sure if that’s fast enough for bigger repos, but 10s seems odd.

~Hmm … do you have a virtualenv setup with all the deps installed? I have it in the .venv in the root of the project.~ https://github.com/iterative/dvc/issues/7756#issuecomment-1133169297

Can you explain more how dvc pull -R is being used by VS Code? In what circumstances is it called?

yep, I right click on data ask to pull and exepct it to bring me data/MNIST/raw inside. W/o me going two level down (I just don’t even know which one is tracked). Or in the example-get-started I’d like to do dvc pull -R data to download data.xml and some intemediate things inside … again, it’a all about simple and intuitive interface to manipulate data.

Current behavior is not useful at all to my mind and comes from some legacy (when dvc pull was taking only .dvc files as targets).

@efiop looks like the change that introduce the renamed issue was this one: https://github.com/iterative/dvc/commit/a80a85e6bd144189bf63df535483ae628136ce14.

Was there anything in there that would make a subrepo exhibit this diff no matter what (~/vscode-dvc is the git repo, ~/vscode-dvc/demo is the dvc project):

~/vscode-dvc/demo ❯  dvc diff
Renamed:                                                              
    demo/data/MNIST/raw/ -> data/MNIST/raw/
    demo/data/MNIST/raw/t10k-images-idx3-ubyte -> data/MNIST/raw/t10k-images-idx3-ubyte
    demo/data/MNIST/raw/t10k-images-idx3-ubyte.gz -> data/MNIST/raw/t10k-images-idx3-ubyte.gz
    demo/data/MNIST/raw/t10k-labels-idx1-ubyte -> data/MNIST/raw/t10k-labels-idx1-ubyte
    demo/data/MNIST/raw/t10k-labels-idx1-ubyte.gz -> data/MNIST/raw/t10k-labels-idx1-ubyte.gz
    demo/data/MNIST/raw/train-images-idx3-ubyte -> data/MNIST/raw/train-images-idx3-ubyte
    demo/data/MNIST/raw/train-images-idx3-ubyte.gz -> data/MNIST/raw/train-images-idx3-ubyte.gz
    demo/data/MNIST/raw/train-labels-idx1-ubyte -> data/MNIST/raw/train-labels-idx1-ubyte
    demo/data/MNIST/raw/train-labels-idx1-ubyte.gz -> data/MNIST/raw/train-labels-idx1-ubyte.gz
    demo/misclassified.jpg -> misclassified.jpg
    demo/model.pt -> model.pt
    demo/predictions.json -> predictions.json
    demo/training_metrics.json -> training_metrics.json
    demo/training_metrics/ -> training_metrics/
    demo/training_metrics/scalars/acc.tsv -> training_metrics/scalars/acc.tsv
    demo/training_metrics/scalars/loss.tsv -> training_metrics/scalars/loss.tsv

I can see that scmrepo got bumped from 0.0.19 to 0.0.22 in that commit.

LMK if you want a separate issue for this.

Thanks.

@shcheklein, --dvc-only was optimized in https://github.com/iterative/dvc/pull/7659. It is not released yet.

@dberenbaum

I’m still not following the original issue since it seems that pull -R is working as expected for me. Can you clarify?

It’s not a bug, but behavior might be confusing to be honest. But even that doesn’t matter - we need to find a way to all tracked files inside a directory. -R gives something completely different (good question how useful it is, but it’s a separate topic). Question - do we have some workaround for the behavior we are looking for?

@mattseddon yes, that’s expected, it’s not a bug. This is the current behavior of the -R . I’ve tried to explain it here. Now we need to find a workaround with the DVC team.

~~Related: https://github.com/iterative/dvc/issues/5326~~

EDIT: not related, it’s a bit different.