dvc: DVC Repro incorrectly saving directory to cache

Bug Report

UPDATE: Skip to https://github.com/iterative/dvc/issues/4428#issuecomment-778259232

As the title states. I run dvc repro extract for my extract pipeline stage. This takes two pretty large zip files and extracts them into specified folders. These folders should not be added to the dvc cache, since they can be easily reproduced by extracting the archives, but I declare them as dependencies so that the DAG looks nicer.

My dvc.yaml looks like this. The preprocess stage should only indicate, that the not cached folders should be used as dependencies in later stages.

stages:
  extract:
    cmd: tar -xzvf data/thingy10k/10k_tetmesh.tar.gz -C data/thingy10k/ && tar -xzvf data/thingy10k/10k_surface.tar.gz -C data/thingy10k/
    deps:
    - data/thingy10k/10k_surface.tar.gz
    - data/thingy10k/10k_tetmesh.tar.gz
    outs:
    - data/thingy10k/10k_surface:
        cache: false
    - data/thingy10k/10k_tetmesh:
        cache: false
  preprocess:
    cmd: some preprocessing command
    deps:
    - data/thingy10k/10k_tetmesh
    - data/thingy10k/10k_surface

After running dvc repro extract, the hashes of all the files are computed and then the files saved to cache. This is exactly the thing that I was trying to prevent with the cache: false option.

I confirmed that the contents of the output folders were indeed added to the cache by using du -sh .dvc/cache, which went up by exactly the size of the two folders after running the command.

Interestingly, after running dvc gc -a the cache is freed again. Also running dvc push (without first running dvc gc -a) to push the cache to my remote storage also says everything is up to date, which leads me to believe that dvc recognizes that these directories should in fact not be cached.

I have reproduced this in a local git and dvc repo by first adding the two archives using dvc add and then running the above mentioned extract stage. The archives can be downloaded from this repo https://github.com/Yixin-Hu/TetWild#dataset under Output.

Please provide information about your setup

Output of dvc version:

$ dvc version
DVC version: 1.6.0 (pip)
---------------------------------
Platform: Python 3.7.6 on Linux-5.4.0-42-generic-x86_64-with-debian-bullseye-sid
Supports: http, https, ssh, webdav, webdavs
Cache types: hardlink, symlink
Repo: dvc, git

Additional Information (if any):

If applicable, please also provide a --verbose output of the command, eg: dvc add --verbose.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 3
  • Comments: 33 (22 by maintainers)

Commits related to this issue

Most upvoted comments

So to summarize all changes being proposed, the above comment explains the suggested behavior for the cache option in dvc.yaml.

The other suggested changes are:

  • dvc repro -f should disable restoring the run or outputs.

  • dvc repro --no-commit should disable saving the run or outputs.

If cache: false is set because some data is sensitive, current behavior could be dangerous. If someone does dvc push --run-cache, for example, they will push the cache: false data to the remote.

Great discussion! Sorry for being so late in replying, but there’s a lot to sort through here.

Current state recap

Seems like the operations of interest here are:

  1. Saving outputs to cache
  2. Restoring outputs from cache
  3. Saving run info to cache
  4. Restoring run info from cache

And currently the way these are handled in dvc repro is:

  1. By default, all 4 operations are enabled on repro.
  2. If dvc repro --no-run-cache is executed, 4 is disabled.
  3. There is no way to disable 1, 2, or 3.

Let me know if any of that is incorrect.

Thoughts/suggestions

dvc repro options

There could be circumstances where a user wants to run a pipeline one time and only do one of restore/save. In those cases, my intuition would be:

  1. dvc repro -f should disable restoring the run or outputs.
  2. dvc repro --no-commit should disable saving the run or outputs.

Neither of those seem to make sense as a consistent setup for a pipeline (there’s no point in consistently performing only one of save/restore), so I think having them available only as dvc repro flags makes sense.

dvc.yaml options

  1. The cache field for outputs currently is designed with small metrics-like outputs in mind. It seems that the objective is not to avoid saving or restoring these outputs (since they are small and are currently being cached anyway), but instead to keep them tracked with git. What if there were a git or gitignore flag that determined whether those files are put into .gitignore but not whether they are cached?
  2. For a use case like a non-deterministic stage, users may want to consistently not save/restore runs, so a stage-level field like run-cache: false (or better named alternative) field in dvc.yaml makes sense to disable both saving and restoring runs.
  3. What if the stage-level field was cache just like the field for individual outputs? This would set the default cache value for all outputs under that stage and control whether the runs get cached. Then, users could still opt in/out of caching individual outputs for that stage. If the stage is cache: true and the output is cache: false, dvc could warn that the stage will not be able to be restored (or even prompt users before proceeding like suggested above).

the run-cache name adds to the confusion with our regular local cache. If there are any good alternative names, we’ll be happy to consider them. IIRC initially we called it stage cache, but it was before we’ve acknowledged the “stage” terminology before 1.0 release.

Maybe a good name would be something more verbose? Since iirc the run_cache caches versions of the output without commiting them, a name like output_cache, cache_output or cache_output_history would be suitable?

Overall, we could definitely introduce run_cache: true/false for stages in dvc.yaml right now, even without corresponding CLI flag, so people could start using it. But in CLI, we could remove --no-run-cache flag in both run/repro, make --force in run/repro ignore run-cache and introduce --no-run-cache to dvc stage add, that will set run_cache: true/false flag.

So if i understand correctly, in essence the config.yaml would gain the run_cache in addition to the “normal” cache flag, which would indicate whether to save the outputs to run cache even if they are not saved to normal cache? To me this would make sense and be fine, though I would almost suggest making the run cache opt in instead of opt out. So if cache: False then run_cache: False would mimic this. For more granular control you could then specify run_cache: True if you were to want the run cache. Or requiring the user to do more work, you could always require specifying both cache and run_cache in the config.yaml.

Sorry if this comment is a bit confusing, just trying to throw some ideas around. Also when I first encountered the bug I was completely confused by the default behavior of dvc saving my 10GB tar file to cache even though I obviously specified cache: False to prevent this, which I think could happen to other new users as well.

I think we should bump the priority on this issue. It seems like no-cache effectively does not exist since 1.0:

#!/bin/bash

rm -rf repo
mkdir repo
pushd repo

git init --quiet
dvc init --quiet

echo hello >> hello
dvc add hello
git add -A
git commit -am "init"

dvc run -n run -d hello -O hello.txt --no-run-cache "cat hello >> hello.txt && echo 'modification' >> hello.txt"

Results in saving hello.txt to cache.

Hi @digitalillusions ! Thank you for reporting this issue! This is a result of our run-cache feature, that started caching even unchaged outputs to be able to restore stages when needed. We definitely had small git-tracked files in mind, but clearly it is not the case here. We could consider somehow explicitly disabling the run-cache for the specified stages. The first idea that comes to my mind is to have some flag for dvc run that would add the flag to the generated stage spec and so dvc would know not to try to save this stage to the run-cache. We already have dvc run --no-run-cache that tells dvc not to use existing run-cache, but it still tries to save the results of this run 🙁

HI everybody, I just found out this behavior while trying to work on a pipeline that should handle sensitive data, and it has become a mild headache.

In my use case. I run a dvc pipeline where the first step that fetches data from a db, and the next steps run some analysis on the fetched data. The results of the analysis can be cached and then pushed to dvc remotes, but the source data should under no circumstance exit the local machine in which they have been queried. I thought that setting cache: false for the output of the fetching step was enough to prevent dvc from caching, but I keep finding them in .dvc/cache after running dvc repro. This makes me very uneasy in running dvc push as I might unintentionally be contributing to a data leak. Do you happen to have a way to handle such scenario?

PS: I thought to remove the queried data from the outputs of the fetching step, but that would make my pipeline a bunch of unconnected steps. Feasible but sub-optimal given dvc usefulness in handling data flows.

@dberenbaum

@pared For the first stage above, is the run cache not triggered because there are no dependencies?

Yes, I noticed this while creating an example for #8582.

Unfortunately, we haven’t gotten to this one for a long time. I’m lowering the priority, and hopefully we will get back to this when we start refactoring pipeline/stage code.

Yes, the default would be to disable the run cache if cache: false for any output. I don’t know that a stage-level run-cache is needed. If cache: true for all outputs, I don’t think there’s any reason not to cache the run, and if cache: false for any output, I don’t think there’s any reason to cache the run.

For anyone using cache: false on metrics-like outputs, they likely assume that those metrics outputs are not being cached by DVC, so I don’t think there’s much harm in disabling the run cache. It might be nice to give the option to have DVC cache those outputs and runs yet still have them tracked by git, but it seems like the consensus is that the added complexity of doing so may not be worthwhile.

@dberenbaum I’m still here and paying close attention! I would be very happy with a solution which disables the run-cache if cache: False, since this addresses my original issue and is is my opinion the most intuitive default behavior. I guess the --no-commit thing is a different issue?

Especially when using large datasets which can be downloaded and dont want to be kept in dvc cache, the -O option should not suddenly start saving these files to run-cache. Of course you could argue that these dont need to be part of pipeline inputs and outputs, but the dvc mechanism still works great for versioning these large datasets even if they are not present in the cache.

Would the current consensus be to add a stage level run-cache option which saves all outputs to run-cache if run-cache: True even if there are cache: False in the dvc.yaml file? The default behavior would be run-cache: False, as mentioned above, as soon as cache: False for any output? With this the choice for run cache would be all outputs or no outputs. This would also be the behavior that makes the most sense, given that the run-cache can be used to restore the entire state of a specific stage, instead of allowing the user to only save some outputs to run-cache. Did I get this right?

I’ve been explaining it as “log of stage runs”. For ex. see https://dvc.org/doc/user-guide/project-structure/internal-files#run-cache. It seems like more of a registry or log to me (part of the project cache, but not a cache on it’s own).

I like this.

@dberenbaum there’s --no-commit for 1 in (as you later mention).

I would think so, but that’s not how it behaves now. Try the example in https://github.com/iterative/dvc/issues/4428#issuecomment-741618372 and add --no-commit, and you will still see hello.txt saved to the cache!

  • I think cache: false in dvc.yaml has a more general purpose but I’m not sure what the original intention was. Interesting idea about a git: false field, although I incline towards keeping the schema simpler if possible. But maybe rename cache to commit (as in run --no-commit) — somewhat unrelated here.

  • Good point. I still think cache/commit :false is enough, but also see no problem in adding it to the stage level (3. whatever name is used).

You seem to be suggesting what @digitalillusions suggested (disable run-cache if any output has cache: false), which seems like a sensible default if wanting to keep the interface as simple as possible but provides a little less flexibility.

For metrics, users may want to cache the results with dvc so that they are tied to a particular dvc run, but they may also want to keep them tracked with git so that it’s easy to see results without dvc (for example, browsing in Github). For use cases like the original one in this issue, users may want dvc to visualize their entire process without actually caching certain outputs, but they may also want to ignore those outputs in git since they would be way too big for git. These could both probably be accomplished by manually updating .gitignore, but that seems clunky.

Seems like there are at least 2 viable options, both of which are likely preferable to the current state. @efiop Interested in your thoughts when you have a chance.

Had a very good discussion with @dberenbaum yesterday, and there are a few clear points:

  1. the run-cache name adds to the confusion with our regular local cache. If there are any good alternative names, we’ll be happy to consider them. IIRC initially we called it stage cache, but it was before we’ve acknowledged the “stage” terminology before 1.0 release.

  2. we def need stage-level flag in dvc.yaml to tell dvc not to save or restore from run-cache. E.g.

stages:
  ...
  - mystage:
      cmd: mycmd
      run_cache: false

this way the changes will be permanent.

  1. the cli flags could be improved. E.g.
    • --force in run/repro should probably imply --no-run-cache
    • current --no-run-cache is actually --no-restore-run-cache and might need to be renamed because of stage-level run_cache: true/false flag.

Overall, we could definitely introduce run_cache: true/false for stages in dvc.yaml right now, even without corresponding CLI flag, so people could start using it. But in CLI, we could remove --no-run-cache flag in both run/repro, make --force in run/repro ignore run-cache and introduce --no-run-cache to dvc stage add, that will set run_cache: true/false flag.

CC @dberenbaum @jorgeorpinel ^

@jimmytudeski Unfortunately there is no LRU functionality in our cache yet, so we would need to figure it out first. A faster, and arguably more proper fix for this is to pass run_cache argument to stage.save and if not run_cache not call stage_cache.save https://github.com/iterative/dvc/blob/d3acbfe2b1aae2d05cb9b8732167e27f7c5f740e/dvc/stage/__init__.py#L439 . Stage.run already receives it https://github.com/iterative/dvc/blob/d3acbfe2b1aae2d05cb9b8732167e27f7c5f740e/dvc/stage/__init__.py#L488 , but only passes it to run_stage, but it should also pass it to stage.save(). That will make dvc run --no-run-cache and dvc repro --no-run-cache not save the run cache for particular stages. Though granularity still requires run_cache: true/false per-output, as described in my other comment above 😦

Would it be an option to provide keyword arguments? So instead of having --no-run-cache, you could have --run-cache (ignore | disable) where ignore would be the old --no-run-cache and disable would be to completely disable the run cache for that specific stage.

Thanks @efiop for the response! Ah thats right, I remember reading about this in the 1.0 update. I also never noticed before, because I set up my pipelines before 1.0 came out and only now made a clean clone of my repo somewhere else and used the repro functionality to reproduce the state.

We could consider somehow explicitly disabling the run-cache for the specified stages. The first idea that comes to my mind is to have some flag for dvc run that would add the flag to the generated stage spec and so dvc would know not to try to save this stage to the run-cache.

I think that would be a good idea. Though to me it would still be somewhat confusing having to, e.g. set both cache: false, run-cache: false in my dvc.yaml. Or would you specify the run-cache: false for the entire pipeline stage? Or only specify run-cache: false which would imply cache: false?

To me it was also just super unintuitive and frustrating to see 12Gb of data being saved to .dvc/cache, when I obviously set the cache flag to false.