dvc: DVC Repro incorrectly saving directory to cache
Bug Report
UPDATE: Skip to https://github.com/iterative/dvc/issues/4428#issuecomment-778259232
As the title states. I run dvc repro extract for my extract pipeline stage. This takes two pretty large zip files and extracts them into specified folders. These folders should not be added to the dvc cache, since they can be easily reproduced by extracting the archives, but I declare them as dependencies so that the DAG looks nicer.
My dvc.yaml looks like this. The preprocess stage should only indicate, that the not cached folders should be used as dependencies in later stages.
stages:
extract:
cmd: tar -xzvf data/thingy10k/10k_tetmesh.tar.gz -C data/thingy10k/ && tar -xzvf data/thingy10k/10k_surface.tar.gz -C data/thingy10k/
deps:
- data/thingy10k/10k_surface.tar.gz
- data/thingy10k/10k_tetmesh.tar.gz
outs:
- data/thingy10k/10k_surface:
cache: false
- data/thingy10k/10k_tetmesh:
cache: false
preprocess:
cmd: some preprocessing command
deps:
- data/thingy10k/10k_tetmesh
- data/thingy10k/10k_surface
After running dvc repro extract, the hashes of all the files are computed and then the files saved to cache. This is exactly the thing that I was trying to prevent with the cache: false option.
I confirmed that the contents of the output folders were indeed added to the cache by using du -sh .dvc/cache, which went up by exactly the size of the two folders after running the command.
Interestingly, after running dvc gc -a the cache is freed again. Also running dvc push (without first running dvc gc -a) to push the cache to my remote storage also says everything is up to date, which leads me to believe that dvc recognizes that these directories should in fact not be cached.
I have reproduced this in a local git and dvc repo by first adding the two archives using dvc add and then running the above mentioned extract stage. The archives can be downloaded from this repo https://github.com/Yixin-Hu/TetWild#dataset under Output.
Please provide information about your setup
Output of dvc version:
$ dvc version
DVC version: 1.6.0 (pip)
---------------------------------
Platform: Python 3.7.6 on Linux-5.4.0-42-generic-x86_64-with-debian-bullseye-sid
Supports: http, https, ssh, webdav, webdavs
Cache types: hardlink, symlink
Repo: dvc, git
Additional Information (if any):
If applicable, please also provide a --verbose output of the command, eg: dvc add --verbose.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 3
- Comments: 33 (22 by maintainers)
Commits related to this issue
- repro: Disable `run-cache` if `--no-commit`. As described in https://github.com/iterative/dvc/issues/4428#issuecomment-778263005. Add `run_cache` argument to `Stage.save`. Set it to `False` if `no_c... — committed to iterative/dvc by daavoo 2 years ago
- stage: Skip `run-cache` if any out has `cache: false`. As described in https://github.com/iterative/dvc/issues/4428#issuecomment-778259232 — committed to iterative/dvc by daavoo 2 years ago
- stage: Skip `run-cache` if any out has `cache: false`. As described in https://github.com/iterative/dvc/issues/4428#issuecomment-778259232 — committed to iterative/dvc by daavoo 2 years ago
- stage: Skip `run-cache` if any out has `cache: false`. As described in https://github.com/iterative/dvc/issues/4428#issuecomment-778259232 — committed to iterative/dvc by daavoo 2 years ago
- stage: Skip `run-cache` if any out has `cache: false`. As described in https://github.com/iterative/dvc/issues/4428#issuecomment-778259232 — committed to iterative/dvc by daavoo 2 years ago
- stage: Skip `run-cache` if any out has `cache: false`. As described in https://github.com/iterative/dvc/issues/4428#issuecomment-778259232 — committed to iterative/dvc by daavoo 2 years ago
- repro: Disable `run-cache` if `--no-commit`. As described in https://github.com/iterative/dvc/issues/4428#issuecomment-778263005. Add `run_cache` argument to `Stage.save`. Set it to `False` if `no_c... — committed to iterative/dvc by daavoo 2 years ago
- repro: Disable `run-cache` if `--no-commit`. As described in https://github.com/iterative/dvc/issues/4428#issuecomment-778263005. Add `run_cache` argument to `Stage.save`. Set it to `False` if `no_c... — committed to iterative/dvc by daavoo 2 years ago
- repro: Disable `run-cache` if `--no-commit`. As described in https://github.com/iterative/dvc/issues/4428#issuecomment-778263005. Add `run_cache` argument to `Stage.save`. Set it to `False` if `no_c... — committed to iterative/dvc by daavoo 2 years ago
- repro: Disable `run-cache` if `--no-commit`. As described in https://github.com/iterative/dvc/issues/4428#issuecomment-778263005. Add `run_cache` argument to `Stage.save`. Set it to `False` if `no_c... — committed to iterative/dvc by daavoo a year ago
- stage: Skip `run-cache` if any out has `cache: false`. Closes https://github.com/iterative/dvc/issues/4428 — committed to iterative/dvc by daavoo a year ago
- stage: Skip `run-cache` if any out has `cache: false`. Closes https://github.com/iterative/dvc/issues/4428 — committed to iterative/dvc by daavoo a year ago
- repro: Disable `run-cache` if `--no-commit`. As described in https://github.com/iterative/dvc/issues/4428#issuecomment-778263005. Add `run_cache` argument to `Stage.save`. Set it to `False` if `no_c... — committed to iterative/dvc by daavoo a year ago
- stage: Skip `run-cache` if any out has `cache: false`. Closes https://github.com/iterative/dvc/issues/4428 — committed to iterative/dvc by daavoo a year ago
- stage: Skip `run-cache` if any out has `cache: false`. Closes https://github.com/iterative/dvc/issues/4428 — committed to iterative/dvc by daavoo a year ago
- stage: Skip `run-cache` if any out has `cache: false`. Closes https://github.com/iterative/dvc/issues/4428 — committed to iterative/dvc by daavoo a year ago
- stage: Skip `run-cache` if any out has `cache: false`. Closes https://github.com/iterative/dvc/issues/4428 — committed to iterative/dvc by daavoo a year ago
- stage: Skip `run-cache` if any out has `cache: false`. Closes https://github.com/iterative/dvc/issues/4428 — committed to iterative/dvc by daavoo a year ago
- stage: Skip `run-cache` if any out has `cache: false`. Closes https://github.com/iterative/dvc/issues/4428 — committed to iterative/dvc by daavoo a year ago
So to summarize all changes being proposed, the above comment explains the suggested behavior for the
cacheoption indvc.yaml.The other suggested changes are:
If
cache: falseis set because some data is sensitive, current behavior could be dangerous. If someone doesdvc push --run-cache, for example, they will push thecache: falsedata to the remote.Great discussion! Sorry for being so late in replying, but there’s a lot to sort through here.
Current state recap
Seems like the operations of interest here are:
And currently the way these are handled in dvc repro is:
dvc repro --no-run-cacheis executed, 4 is disabled.Let me know if any of that is incorrect.
Thoughts/suggestions
dvc repro options
There could be circumstances where a user wants to run a pipeline one time and only do one of restore/save. In those cases, my intuition would be:
dvc repro -fshould disable restoring the run or outputs.dvc repro --no-commitshould disable saving the run or outputs.Neither of those seem to make sense as a consistent setup for a pipeline (there’s no point in consistently performing only one of save/restore), so I think having them available only as
dvc reproflags makes sense.dvc.yaml options
cachefield for outputs currently is designed with small metrics-like outputs in mind. It seems that the objective is not to avoid saving or restoring these outputs (since they are small and are currently being cached anyway), but instead to keep them tracked with git. What if there were agitorgitignoreflag that determined whether those files are put into.gitignorebut not whether they are cached?run-cache: false(or better named alternative) field indvc.yamlmakes sense to disable both saving and restoring runs.cachejust like the field for individual outputs? This would set the defaultcachevalue for all outputs under that stage and control whether the runs get cached. Then, users could still opt in/out of caching individual outputs for that stage. If the stage iscache: trueand the output iscache: false, dvc could warn that the stage will not be able to be restored (or even prompt users before proceeding like suggested above).Maybe a good name would be something more verbose? Since iirc the
run_cachecaches versions of the output without commiting them, a name likeoutput_cache,cache_outputorcache_output_historywould be suitable?So if i understand correctly, in essence the
config.yamlwould gain therun_cachein addition to the “normal”cacheflag, which would indicate whether to save the outputs to run cache even if they are not saved to normal cache? To me this would make sense and be fine, though I would almost suggest making the run cache opt in instead of opt out. So ifcache: Falsethenrun_cache: Falsewould mimic this. For more granular control you could then specifyrun_cache: Trueif you were to want the run cache. Or requiring the user to do more work, you could always require specifying bothcacheandrun_cachein the config.yaml.Sorry if this comment is a bit confusing, just trying to throw some ideas around. Also when I first encountered the bug I was completely confused by the default behavior of dvc saving my 10GB tar file to cache even though I obviously specified
cache: Falseto prevent this, which I think could happen to other new users as well.I think we should bump the priority on this issue. It seems like
no-cacheeffectively does not exist since 1.0:Results in saving
hello.txtto cache.Hi @digitalillusions ! Thank you for reporting this issue! This is a result of our run-cache feature, that started caching even unchaged outputs to be able to restore stages when needed. We definitely had small git-tracked files in mind, but clearly it is not the case here. We could consider somehow explicitly disabling the run-cache for the specified stages. The first idea that comes to my mind is to have some flag for
dvc runthat would add the flag to the generated stage spec and so dvc would know not to try to save this stage to the run-cache. We already havedvc run --no-run-cachethat tells dvc not to use existing run-cache, but it still tries to save the results of this run 🙁HI everybody, I just found out this behavior while trying to work on a pipeline that should handle sensitive data, and it has become a mild headache.
In my use case. I run a dvc pipeline where the first step that fetches data from a db, and the next steps run some analysis on the fetched data. The results of the analysis can be cached and then pushed to dvc remotes, but the source data should under no circumstance exit the local machine in which they have been queried. I thought that setting
cache: falsefor the output of the fetching step was enough to prevent dvc from caching, but I keep finding them in.dvc/cacheafter runningdvc repro. This makes me very uneasy in runningdvc pushas I might unintentionally be contributing to a data leak. Do you happen to have a way to handle such scenario?PS: I thought to remove the queried data from the outputs of the fetching step, but that would make my pipeline a bunch of unconnected steps. Feasible but sub-optimal given dvc usefulness in handling data flows.
@dberenbaum
Yes, I noticed this while creating an example for #8582.
Unfortunately, we haven’t gotten to this one for a long time. I’m lowering the priority, and hopefully we will get back to this when we start refactoring pipeline/stage code.
Yes, the default would be to disable the run cache if
cache: falsefor any output. I don’t know that a stage-levelrun-cacheis needed. Ifcache: truefor all outputs, I don’t think there’s any reason not to cache the run, and ifcache: falsefor any output, I don’t think there’s any reason to cache the run.For anyone using
cache: falseon metrics-like outputs, they likely assume that those metrics outputs are not being cached by DVC, so I don’t think there’s much harm in disabling the run cache. It might be nice to give the option to have DVC cache those outputs and runs yet still have them tracked by git, but it seems like the consensus is that the added complexity of doing so may not be worthwhile.@dberenbaum I’m still here and paying close attention! I would be very happy with a solution which disables the
run-cacheifcache: False, since this addresses my original issue and is is my opinion the most intuitive default behavior. I guess the--no-committhing is a different issue?Especially when using large datasets which can be downloaded and dont want to be kept in dvc cache, the
-Ooption should not suddenly start saving these files torun-cache. Of course you could argue that these dont need to be part of pipeline inputs and outputs, but the dvc mechanism still works great for versioning these large datasets even if they are not present in the cache.Would the current consensus be to add a stage level
run-cacheoption which saves all outputs torun-cacheifrun-cache: Trueeven if there arecache: Falsein thedvc.yamlfile? The default behavior would berun-cache: False, as mentioned above, as soon ascache: Falsefor any output? With this the choice for run cache would be all outputs or no outputs. This would also be the behavior that makes the most sense, given that therun-cachecan be used to restore the entire state of a specific stage, instead of allowing the user to only save some outputs torun-cache. Did I get this right?I like this.
I would think so, but that’s not how it behaves now. Try the example in https://github.com/iterative/dvc/issues/4428#issuecomment-741618372 and add
--no-commit, and you will still seehello.txtsaved to the cache!You seem to be suggesting what @digitalillusions suggested (disable run-cache if any output has
cache: false), which seems like a sensible default if wanting to keep the interface as simple as possible but provides a little less flexibility.For metrics, users may want to cache the results with dvc so that they are tied to a particular dvc run, but they may also want to keep them tracked with git so that it’s easy to see results without dvc (for example, browsing in Github). For use cases like the original one in this issue, users may want dvc to visualize their entire process without actually caching certain outputs, but they may also want to ignore those outputs in git since they would be way too big for git. These could both probably be accomplished by manually updating
.gitignore, but that seems clunky.Seems like there are at least 2 viable options, both of which are likely preferable to the current state. @efiop Interested in your thoughts when you have a chance.
Had a very good discussion with @dberenbaum yesterday, and there are a few clear points:
the
run-cachename adds to the confusion with our regular local cache. If there are any good alternative names, we’ll be happy to consider them. IIRC initially we called itstage cache, but it was before we’ve acknowledged the “stage” terminology before 1.0 release.we def need stage-level flag in dvc.yaml to tell dvc not to save or restore from run-cache. E.g.
this way the changes will be permanent.
--forcein run/repro should probably imply--no-run-cache--no-run-cacheis actually--no-restore-run-cacheand might need to be renamed because of stage-levelrun_cache: true/falseflag.Overall, we could definitely introduce
run_cache: true/falsefor stages in dvc.yaml right now, even without corresponding CLI flag, so people could start using it. But in CLI, we could remove--no-run-cacheflag in both run/repro, make--forcein run/repro ignore run-cache and introduce--no-run-cachetodvc stage add, that will setrun_cache: true/falseflag.CC @dberenbaum @jorgeorpinel ^
@jimmytudeski Unfortunately there is no LRU functionality in our cache yet, so we would need to figure it out first. A faster, and arguably more proper fix for this is to pass
run_cacheargument tostage.saveand ifnot run_cachenot callstage_cache.savehttps://github.com/iterative/dvc/blob/d3acbfe2b1aae2d05cb9b8732167e27f7c5f740e/dvc/stage/__init__.py#L439 .Stage.runalready receives it https://github.com/iterative/dvc/blob/d3acbfe2b1aae2d05cb9b8732167e27f7c5f740e/dvc/stage/__init__.py#L488 , but only passes it torun_stage, but it should also pass it tostage.save(). That will makedvc run --no-run-cacheanddvc repro --no-run-cachenot save the run cache for particular stages. Though granularity still requiresrun_cache: true/falseper-output, as described in my other comment above 😦Would it be an option to provide keyword arguments? So instead of having
--no-run-cache, you could have--run-cache (ignore | disable)whereignorewould be the old--no-run-cacheanddisablewould be to completely disable the run cache for that specific stage.Thanks @efiop for the response! Ah thats right, I remember reading about this in the 1.0 update. I also never noticed before, because I set up my pipelines before 1.0 came out and only now made a clean clone of my repo somewhere else and used the repro functionality to reproduce the state.
I think that would be a good idea. Though to me it would still be somewhat confusing having to, e.g. set both
cache: false, run-cache: falsein mydvc.yaml. Or would you specify therun-cache: falsefor the entire pipeline stage? Or only specifyrun-cache: falsewhich would implycache: false?To me it was also just super unintuitive and frustrating to see 12Gb of data being saved to
.dvc/cache, when I obviously set thecacheflag to false.