dvc: Make that `dvc repro --pull` pulls all missing files.

According to help for dvc repro:

  --pull                Try automatically pulling missing cache for outputs
                        restored from the run-cache.

and that’s what it does. Pulls missing files that are outputs restored from the run-cache. But if there are outputs missing from “sources” (i.e. dvc files having only output and no command at all), it won’t pull them. These must be pulled separately with a pull command.

Why not change it into “pull whatever is missing and necessary for this repro”? Is there any use case where a user want to download automatically some missing files, but not all of them?

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 14
  • Comments: 15 (11 by maintainers)

Commits related to this issue

Most upvoted comments

In the perspective of running a stage on a “clean” container, you want to run a stage only if:

  1. the output will be different from one already computed - so, in case, don’t download needed inputs to save resources.
  2. pulling the output to overwrite it is just a waste of resources.

Hi, I am leaving my comment as a user as suggested by @dberenbaum, I hope it can be helpful.

Let’s take the example A -> B -> C mentioned above. After defining the pipeline I want to run each stage in a dedicated container using an orchestrator like Argo (or Kubeflow or Airflow or …). So, for example, I will have a step in Argo where I want to execute the stage B. Then I would like to be able to do somthing along these lines:

git clone git-repo-containing-the-dvc-files
cd git-repo-containing-the-dvc-files
dvc repro B --new-flag
dvc add -R .
dvc push -R .
git add -A .
git commit -m "executed stage B"
git push

(--new-flag could be something like --pull-direct-dependency-only)

The command dvc repro B --new-flag should ideally pull (download) a.json (output of stage A) only if it changed, and it should run the stage B again only if there was any change.

Note from a design perspective: “There should be one-- an preferably only one --obvious way to do it”. I can imagine that a user may want to just pull the data as described without the repro part. I am wondering if it is wise to put everything in one command, or if it rather makes more sense to split what I described in two: pull and repro (already existing). So, something like:

dvc pull B --new-flag && dvc repro B

(--new-flag could be something like --only-changed-direct-dependency) This is up to your judgment.

After #5369, we should have the ability to determine whether a stage needs to be run without pulling the data locally. At that point, solving this issue seems straightforward (although still more complex than a typical CLI flag):

  1. Check the stage status without pulling data locally.
  2. If the status isn’t clean, pull the dependencies and run the stage.

In the initial comment for this issue, there is a good summary: “pull whatever is missing and necessary for this repro.”

I think it’s related to https://github.com/iterative/dvc/issues/5369, and that may be a prerequisite for the best-case scenario. A good solution for #5369 would determine whether there are any modified stages, taking into account what can be checked out from the cache/run-cache or pulled from remote storage.

Ideally, there should be an equivalent to do this but have the pipeline checkout/pull what’s needed and run the necessary stages. For each stage, the command should do something like:

  1. Check if dependencies have changed.
  2. If dependencies are missing, check to see if they are available anywhere locally or remotely. Don’t checkout/pull yet.
  3. Check the run-cache locally and remotely to determine if the stage needs to be run.
  4. If the stage needs to be run, checkout/pull as needed and run the stage.

I don’t think dvc pull is sufficient because:

  1. I don’t want to pull unnecessary data. a. Pulling dependencies from unmodified or upstream stages doesn’t help. Dependencies only need to be pulled if other dependencies for that stage are modified. b. Pulling outputs doesn’t help. Intermediate outputs that are needed as dependencies of subsequent stages can be pulled later only if that subsequent stage needs to be run.
  2. I don’t want to have to determine which stages require me to pull. Having dvc pull flags in addition could be helpful for those who don’t want to automatically kick off repro, but I don’t think they are enough by themselves.

Another request in discord: https://discord.com/channels/485586884165107732/485596304961962003/945713992825991188

As noted there, this is really important for remote execution, so I think we need to prioritize this soon. cc @pmrowla @karajan1001

Determining whether or not the upstream stages are modified requires on having the dependencies available (to see whether or not they have changed)

I would like to have an option to use the .dvc files and dvc.lock outputs from previous stages to pull missing dependencies. For example, I change and repro A, push the changes, then git clone elsewhere to repro B. Using the updated dvc.lock, DVC should be able to tell that B dependencies have changed, pull the necessary output from stage A, and run stage B.

DVC doesn’t work this way right now, and supporting this would be a bigger change than just repro --pull.

You’re right, it’s a bigger change than a typical command option.

In this case, it sounds like you are talking about where I have a pipeline like A -> B -> C, and I want to repro C, so DVC would only check the dependencies for A, since the intermediate outputs for A and B are not actually “required” here, just the deps for A and outputs for C. Is that correct?

If I want to repro C:

  1. Check the dependencies for A. If any are missing but have a corresponding .dvc file, assume that’s what should be used in A.
  2. If there are changes and no corresponding entry in the run-cache, pull the missing dependencies and run A. Otherwise, skip A and don’t pull dependencies (update dvc.lock to reflect any changes retrieved from the run-cache).
  3. Check the dependencies for B. If A was skipped, its outputs will be missing. Use its dvc.lock outputs to determine the missing dependencies for B.
  4. Repeat this process for the rest of B and C.