dvc: DVC Repro incorrectly saving directory to cache
Bug Report
UPDATE: Skip to https://github.com/iterative/dvc/issues/4428#issuecomment-778259232
As the title states. I run dvc repro extract
for my extract
pipeline stage. This takes two pretty large zip files and extracts them into specified folders. These folders should not be added to the dvc cache, since they can be easily reproduced by extracting the archives, but I declare them as dependencies so that the DAG looks nicer.
My dvc.yaml
looks like this. The preprocess
stage should only indicate, that the not cached folders should be used as dependencies in later stages.
stages:
extract:
cmd: tar -xzvf data/thingy10k/10k_tetmesh.tar.gz -C data/thingy10k/ && tar -xzvf data/thingy10k/10k_surface.tar.gz -C data/thingy10k/
deps:
- data/thingy10k/10k_surface.tar.gz
- data/thingy10k/10k_tetmesh.tar.gz
outs:
- data/thingy10k/10k_surface:
cache: false
- data/thingy10k/10k_tetmesh:
cache: false
preprocess:
cmd: some preprocessing command
deps:
- data/thingy10k/10k_tetmesh
- data/thingy10k/10k_surface
After running dvc repro extract
, the hashes of all the files are computed and then the files saved to cache. This is exactly the thing that I was trying to prevent with the cache: false
option.
I confirmed that the contents of the output folders were indeed added to the cache by using du -sh .dvc/cache
, which went up by exactly the size of the two folders after running the command.
Interestingly, after running dvc gc -a
the cache is freed again. Also running dvc push
(without first running dvc gc -a
) to push the cache to my remote storage also says everything is up to date, which leads me to believe that dvc recognizes that these directories should in fact not be cached.
I have reproduced this in a local git and dvc repo by first adding the two archives using dvc add
and then running the above mentioned extract stage. The archives can be downloaded from this repo https://github.com/Yixin-Hu/TetWild#dataset under Output
.
Please provide information about your setup
Output of dvc version
:
$ dvc version
DVC version: 1.6.0 (pip)
---------------------------------
Platform: Python 3.7.6 on Linux-5.4.0-42-generic-x86_64-with-debian-bullseye-sid
Supports: http, https, ssh, webdav, webdavs
Cache types: hardlink, symlink
Repo: dvc, git
Additional Information (if any):
If applicable, please also provide a --verbose
output of the command, eg: dvc add --verbose
.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 3
- Comments: 33 (22 by maintainers)
Commits related to this issue
- repro: Disable `run-cache` if `--no-commit`. As described in https://github.com/iterative/dvc/issues/4428#issuecomment-778263005. Add `run_cache` argument to `Stage.save`. Set it to `False` if `no_c... — committed to iterative/dvc by daavoo 2 years ago
- stage: Skip `run-cache` if any out has `cache: false`. As described in https://github.com/iterative/dvc/issues/4428#issuecomment-778259232 — committed to iterative/dvc by daavoo 2 years ago
- stage: Skip `run-cache` if any out has `cache: false`. As described in https://github.com/iterative/dvc/issues/4428#issuecomment-778259232 — committed to iterative/dvc by daavoo 2 years ago
- stage: Skip `run-cache` if any out has `cache: false`. As described in https://github.com/iterative/dvc/issues/4428#issuecomment-778259232 — committed to iterative/dvc by daavoo 2 years ago
- stage: Skip `run-cache` if any out has `cache: false`. As described in https://github.com/iterative/dvc/issues/4428#issuecomment-778259232 — committed to iterative/dvc by daavoo 2 years ago
- stage: Skip `run-cache` if any out has `cache: false`. As described in https://github.com/iterative/dvc/issues/4428#issuecomment-778259232 — committed to iterative/dvc by daavoo 2 years ago
- repro: Disable `run-cache` if `--no-commit`. As described in https://github.com/iterative/dvc/issues/4428#issuecomment-778263005. Add `run_cache` argument to `Stage.save`. Set it to `False` if `no_c... — committed to iterative/dvc by daavoo 2 years ago
- repro: Disable `run-cache` if `--no-commit`. As described in https://github.com/iterative/dvc/issues/4428#issuecomment-778263005. Add `run_cache` argument to `Stage.save`. Set it to `False` if `no_c... — committed to iterative/dvc by daavoo 2 years ago
- repro: Disable `run-cache` if `--no-commit`. As described in https://github.com/iterative/dvc/issues/4428#issuecomment-778263005. Add `run_cache` argument to `Stage.save`. Set it to `False` if `no_c... — committed to iterative/dvc by daavoo 2 years ago
- repro: Disable `run-cache` if `--no-commit`. As described in https://github.com/iterative/dvc/issues/4428#issuecomment-778263005. Add `run_cache` argument to `Stage.save`. Set it to `False` if `no_c... — committed to iterative/dvc by daavoo a year ago
- stage: Skip `run-cache` if any out has `cache: false`. Closes https://github.com/iterative/dvc/issues/4428 — committed to iterative/dvc by daavoo a year ago
- stage: Skip `run-cache` if any out has `cache: false`. Closes https://github.com/iterative/dvc/issues/4428 — committed to iterative/dvc by daavoo a year ago
- repro: Disable `run-cache` if `--no-commit`. As described in https://github.com/iterative/dvc/issues/4428#issuecomment-778263005. Add `run_cache` argument to `Stage.save`. Set it to `False` if `no_c... — committed to iterative/dvc by daavoo a year ago
- stage: Skip `run-cache` if any out has `cache: false`. Closes https://github.com/iterative/dvc/issues/4428 — committed to iterative/dvc by daavoo a year ago
- stage: Skip `run-cache` if any out has `cache: false`. Closes https://github.com/iterative/dvc/issues/4428 — committed to iterative/dvc by daavoo a year ago
- stage: Skip `run-cache` if any out has `cache: false`. Closes https://github.com/iterative/dvc/issues/4428 — committed to iterative/dvc by daavoo a year ago
- stage: Skip `run-cache` if any out has `cache: false`. Closes https://github.com/iterative/dvc/issues/4428 — committed to iterative/dvc by daavoo a year ago
- stage: Skip `run-cache` if any out has `cache: false`. Closes https://github.com/iterative/dvc/issues/4428 — committed to iterative/dvc by daavoo a year ago
- stage: Skip `run-cache` if any out has `cache: false`. Closes https://github.com/iterative/dvc/issues/4428 — committed to iterative/dvc by daavoo a year ago
So to summarize all changes being proposed, the above comment explains the suggested behavior for the
cache
option indvc.yaml
.The other suggested changes are:
If
cache: false
is set because some data is sensitive, current behavior could be dangerous. If someone doesdvc push --run-cache
, for example, they will push thecache: false
data to the remote.Great discussion! Sorry for being so late in replying, but there’s a lot to sort through here.
Current state recap
Seems like the operations of interest here are:
And currently the way these are handled in dvc repro is:
dvc repro --no-run-cache
is executed, 4 is disabled.Let me know if any of that is incorrect.
Thoughts/suggestions
dvc repro options
There could be circumstances where a user wants to run a pipeline one time and only do one of restore/save. In those cases, my intuition would be:
dvc repro -f
should disable restoring the run or outputs.dvc repro --no-commit
should disable saving the run or outputs.Neither of those seem to make sense as a consistent setup for a pipeline (there’s no point in consistently performing only one of save/restore), so I think having them available only as
dvc repro
flags makes sense.dvc.yaml options
cache
field for outputs currently is designed with small metrics-like outputs in mind. It seems that the objective is not to avoid saving or restoring these outputs (since they are small and are currently being cached anyway), but instead to keep them tracked with git. What if there were agit
orgitignore
flag that determined whether those files are put into.gitignore
but not whether they are cached?run-cache: false
(or better named alternative) field indvc.yaml
makes sense to disable both saving and restoring runs.cache
just like the field for individual outputs? This would set the defaultcache
value for all outputs under that stage and control whether the runs get cached. Then, users could still opt in/out of caching individual outputs for that stage. If the stage iscache: true
and the output iscache: false
, dvc could warn that the stage will not be able to be restored (or even prompt users before proceeding like suggested above).Maybe a good name would be something more verbose? Since iirc the
run_cache
caches versions of the output without commiting them, a name likeoutput_cache
,cache_output
orcache_output_history
would be suitable?So if i understand correctly, in essence the
config.yaml
would gain therun_cache
in addition to the “normal”cache
flag, which would indicate whether to save the outputs to run cache even if they are not saved to normal cache? To me this would make sense and be fine, though I would almost suggest making the run cache opt in instead of opt out. So ifcache: False
thenrun_cache: False
would mimic this. For more granular control you could then specifyrun_cache: True
if you were to want the run cache. Or requiring the user to do more work, you could always require specifying bothcache
andrun_cache
in the config.yaml.Sorry if this comment is a bit confusing, just trying to throw some ideas around. Also when I first encountered the bug I was completely confused by the default behavior of dvc saving my 10GB tar file to cache even though I obviously specified
cache: False
to prevent this, which I think could happen to other new users as well.I think we should bump the priority on this issue. It seems like
no-cache
effectively does not exist since 1.0:Results in saving
hello.txt
to cache.Hi @digitalillusions ! Thank you for reporting this issue! This is a result of our run-cache feature, that started caching even unchaged outputs to be able to restore stages when needed. We definitely had small git-tracked files in mind, but clearly it is not the case here. We could consider somehow explicitly disabling the run-cache for the specified stages. The first idea that comes to my mind is to have some flag for
dvc run
that would add the flag to the generated stage spec and so dvc would know not to try to save this stage to the run-cache. We already havedvc run --no-run-cache
that tells dvc not to use existing run-cache, but it still tries to save the results of this run 🙁HI everybody, I just found out this behavior while trying to work on a pipeline that should handle sensitive data, and it has become a mild headache.
In my use case. I run a dvc pipeline where the first step that fetches data from a db, and the next steps run some analysis on the fetched data. The results of the analysis can be cached and then pushed to dvc remotes, but the source data should under no circumstance exit the local machine in which they have been queried. I thought that setting
cache: false
for the output of the fetching step was enough to prevent dvc from caching, but I keep finding them in.dvc/cache
after runningdvc repro
. This makes me very uneasy in runningdvc push
as I might unintentionally be contributing to a data leak. Do you happen to have a way to handle such scenario?PS: I thought to remove the queried data from the outputs of the fetching step, but that would make my pipeline a bunch of unconnected steps. Feasible but sub-optimal given dvc usefulness in handling data flows.
@dberenbaum
Yes, I noticed this while creating an example for #8582.
Unfortunately, we haven’t gotten to this one for a long time. I’m lowering the priority, and hopefully we will get back to this when we start refactoring pipeline/stage code.
Yes, the default would be to disable the run cache if
cache: false
for any output. I don’t know that a stage-levelrun-cache
is needed. Ifcache: true
for all outputs, I don’t think there’s any reason not to cache the run, and ifcache: false
for any output, I don’t think there’s any reason to cache the run.For anyone using
cache: false
on metrics-like outputs, they likely assume that those metrics outputs are not being cached by DVC, so I don’t think there’s much harm in disabling the run cache. It might be nice to give the option to have DVC cache those outputs and runs yet still have them tracked by git, but it seems like the consensus is that the added complexity of doing so may not be worthwhile.@dberenbaum I’m still here and paying close attention! I would be very happy with a solution which disables the
run-cache
ifcache: False
, since this addresses my original issue and is is my opinion the most intuitive default behavior. I guess the--no-commit
thing is a different issue?Especially when using large datasets which can be downloaded and dont want to be kept in dvc cache, the
-O
option should not suddenly start saving these files torun-cache
. Of course you could argue that these dont need to be part of pipeline inputs and outputs, but the dvc mechanism still works great for versioning these large datasets even if they are not present in the cache.Would the current consensus be to add a stage level
run-cache
option which saves all outputs torun-cache
ifrun-cache: True
even if there arecache: False
in thedvc.yaml
file? The default behavior would berun-cache: False
, as mentioned above, as soon ascache: False
for any output? With this the choice for run cache would be all outputs or no outputs. This would also be the behavior that makes the most sense, given that therun-cache
can be used to restore the entire state of a specific stage, instead of allowing the user to only save some outputs torun-cache
. Did I get this right?I like this.
I would think so, but that’s not how it behaves now. Try the example in https://github.com/iterative/dvc/issues/4428#issuecomment-741618372 and add
--no-commit
, and you will still seehello.txt
saved to the cache!You seem to be suggesting what @digitalillusions suggested (disable run-cache if any output has
cache: false
), which seems like a sensible default if wanting to keep the interface as simple as possible but provides a little less flexibility.For metrics, users may want to cache the results with dvc so that they are tied to a particular dvc run, but they may also want to keep them tracked with git so that it’s easy to see results without dvc (for example, browsing in Github). For use cases like the original one in this issue, users may want dvc to visualize their entire process without actually caching certain outputs, but they may also want to ignore those outputs in git since they would be way too big for git. These could both probably be accomplished by manually updating
.gitignore
, but that seems clunky.Seems like there are at least 2 viable options, both of which are likely preferable to the current state. @efiop Interested in your thoughts when you have a chance.
Had a very good discussion with @dberenbaum yesterday, and there are a few clear points:
the
run-cache
name adds to the confusion with our regular local cache. If there are any good alternative names, we’ll be happy to consider them. IIRC initially we called itstage cache
, but it was before we’ve acknowledged the “stage” terminology before 1.0 release.we def need stage-level flag in dvc.yaml to tell dvc not to save or restore from run-cache. E.g.
this way the changes will be permanent.
--force
in run/repro should probably imply--no-run-cache
--no-run-cache
is actually--no-restore-run-cache
and might need to be renamed because of stage-levelrun_cache: true/false
flag.Overall, we could definitely introduce
run_cache: true/false
for stages in dvc.yaml right now, even without corresponding CLI flag, so people could start using it. But in CLI, we could remove--no-run-cache
flag in both run/repro, make--force
in run/repro ignore run-cache and introduce--no-run-cache
todvc stage add
, that will setrun_cache: true/false
flag.CC @dberenbaum @jorgeorpinel ^
@jimmytudeski Unfortunately there is no LRU functionality in our cache yet, so we would need to figure it out first. A faster, and arguably more proper fix for this is to pass
run_cache
argument tostage.save
and ifnot run_cache
not callstage_cache.save
https://github.com/iterative/dvc/blob/d3acbfe2b1aae2d05cb9b8732167e27f7c5f740e/dvc/stage/__init__.py#L439 .Stage.run
already receives it https://github.com/iterative/dvc/blob/d3acbfe2b1aae2d05cb9b8732167e27f7c5f740e/dvc/stage/__init__.py#L488 , but only passes it torun_stage
, but it should also pass it tostage.save()
. That will makedvc run --no-run-cache
anddvc repro --no-run-cache
not save the run cache for particular stages. Though granularity still requiresrun_cache: true/false
per-output, as described in my other comment above 😦Would it be an option to provide keyword arguments? So instead of having
--no-run-cache
, you could have--run-cache (ignore | disable)
whereignore
would be the old--no-run-cache
anddisable
would be to completely disable the run cache for that specific stage.Thanks @efiop for the response! Ah thats right, I remember reading about this in the 1.0 update. I also never noticed before, because I set up my pipelines before 1.0 came out and only now made a clean clone of my repo somewhere else and used the repro functionality to reproduce the state.
I think that would be a good idea. Though to me it would still be somewhat confusing having to, e.g. set both
cache: false, run-cache: false
in mydvc.yaml
. Or would you specify therun-cache: false
for the entire pipeline stage? Or only specifyrun-cache: false
which would implycache: false
?To me it was also just super unintuitive and frustrating to see 12Gb of data being saved to
.dvc/cache
, when I obviously set thecache
flag to false.