dvc: Make that `dvc repro --pull` pulls all missing files.
According to help for dvc repro
:
--pull Try automatically pulling missing cache for outputs
restored from the run-cache.
and that’s what it does. Pulls missing files that are outputs restored from the run-cache. But if there are outputs missing from “sources” (i.e. dvc files having only output and no command at all), it won’t pull them. These must be pulled separately with a pull
command.
Why not change it into “pull whatever is missing and necessary for this repro”? Is there any use case where a user want to download automatically some missing files, but not all of them?
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 14
- Comments: 15 (11 by maintainers)
Commits related to this issue
- reproduce: Make `pull` flag to also pull `run cache`, `data_sources` and `imports`. Closes #4742 — committed to iterative/dvc by daavoo a year ago
- reproduce: Update `pull` flag to also pull `run cache`, `data_sources` and `imports`. Closes #4742 — committed to iterative/dvc by daavoo a year ago
- reproduce: Update pull flag to also pull run cache, data_sources and imports. Closes #4742 — committed to iterative/dvc by daavoo a year ago
In the perspective of running a stage on a “clean” container, you want to run a stage only if:
Hi, I am leaving my comment as a user as suggested by @dberenbaum, I hope it can be helpful.
Let’s take the example A -> B -> C mentioned above. After defining the pipeline I want to run each stage in a dedicated container using an orchestrator like Argo (or Kubeflow or Airflow or …). So, for example, I will have a step in Argo where I want to execute the stage B. Then I would like to be able to do somthing along these lines:
(
--new-flag
could be something like--pull-direct-dependency-only
)The command
dvc repro B --new-flag
should ideally pull (download)a.json
(output of stage A) only if it changed, and it should run the stage B again only if there was any change.Note from a design perspective: “There should be one-- an preferably only one --obvious way to do it”. I can imagine that a user may want to just pull the data as described without the repro part. I am wondering if it is wise to put everything in one command, or if it rather makes more sense to split what I described in two: pull and repro (already existing). So, something like:
(
--new-flag
could be something like--only-changed-direct-dependency
) This is up to your judgment.After #5369, we should have the ability to determine whether a stage needs to be run without pulling the data locally. At that point, solving this issue seems straightforward (although still more complex than a typical CLI flag):
In the initial comment for this issue, there is a good summary: “pull whatever is missing and necessary for this repro.”
I think it’s related to https://github.com/iterative/dvc/issues/5369, and that may be a prerequisite for the best-case scenario. A good solution for #5369 would determine whether there are any modified stages, taking into account what can be checked out from the cache/run-cache or pulled from remote storage.
Ideally, there should be an equivalent to do this but have the pipeline checkout/pull what’s needed and run the necessary stages. For each stage, the command should do something like:
I don’t think
dvc pull
is sufficient because:dvc pull
flags in addition could be helpful for those who don’t want to automatically kick offrepro
, but I don’t think they are enough by themselves.Another request in discord: https://discord.com/channels/485586884165107732/485596304961962003/945713992825991188
As noted there, this is really important for remote execution, so I think we need to prioritize this soon. cc @pmrowla @karajan1001
I would like to have an option to use the
.dvc
files anddvc.lock
outputs from previous stages to pull missing dependencies. For example, I change and repro A, push the changes, then git clone elsewhere to repro B. Using the updateddvc.lock
, DVC should be able to tell that B dependencies have changed, pull the necessary output from stage A, and run stage B.You’re right, it’s a bigger change than a typical command option.
If I want to repro
C
:A
. If any are missing but have a corresponding.dvc
file, assume that’s what should be used inA
.A
. Otherwise, skipA
and don’t pull dependencies (updatedvc.lock
to reflect any changes retrieved from the run-cache).B
. IfA
was skipped, its outputs will be missing. Use itsdvc.lock
outputs to determine the missing dependencies forB
.B
andC
.