dvc: Bug: `dvc add` fails with a modified file (or directory) at the end of a list of files
Using DVC 0.71.0 on RHEL6, installed via conda
, configured for system-wide hard links (not sure if it’s relevant):
$ dvc --version
0.71.0
$ cat ~/.config/dvc/config
[cache]
protected = true
type = hardlink
I have come across what appears to be a bug, where attempting to dvc add
a directory previously under DVC version control whose contents have changed, results in an error, but only when adding along with a list of files, and doesn’t occur if the command is repeated (i.e. after all the other files have been added). I have reproduced this in several project directories and via the following minimal example (courtesy of @mroutis):
dvc init --no-scm
mkdir data
echo "foo" > data/foo
dvc add data
echo "bar" > bar.txt
dvc unprotect data
echo "change" > data/foo
dvc add bar.txt data
This results in the following output:
WARNING: Output 'data' of 'data.dvc' changed because it is 'modified'
100%|██████████|Add 2.00/2.00 [00:02<00:00, 1.08s/file]
ERROR: file/directory 'data' is specified as an output in more than one stage: data.dvc
data.dvc
But if the command is re-run:
$ dvc add bar.txt data
Stage is cached, skipping.
100% Add 1.00/1.00 [00:01<00:00, 1.96s/file]
So it appears dvc
is somehow mishandling the list of files. Of course, the expected behavior is that it will add the directory successfully on the first try.
Thanks for any effort to track down the source of the bug.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 16 (10 by maintainers)
Commits related to this issue
- add: do not delete stage files before add Because of the way we collect stages and cache them, we were not able to collect them for the `add` without removing them from the workspace. As doing so, we... — committed to skshetry/dvc by skshetry 3 years ago
- add: do not delete stage files before add (#6239) * add: do not delete stage files before add Because of the way we collect stages and cache them, we were not able to collect them for the `add` w... — committed to iterative/dvc by skshetry 3 years ago
- add: do not delete stage files before add Because of the way we collect stages and cache them, we were not able to collect them for the `add` without removing them from the workspace. As doing so, we... — committed to skshetry/dvc by skshetry 3 years ago
@iterative/engineering it seems like an important bug. Should we make it p0?
can confirm that problem still exists on master:
Terrific. Thanks for the fix!
@pared @jaredsampson Released in 0.86.4, please upgrade and give it a try 🙂
@jaredsampson the fix is done and should be included in next release
Ok, so little investigation helped me to narrow down the issue. The bug shows up when we are re-
add
-ing existing DVC-tracked output, but only after callingdvc.stages
first. For example: Latest version i checked with: 0.86.1We will get the same OutputDuplicationError.
repo.stages
is cached property, so repo is storingfile.dvc
. The problem is that when we are re-add
-ing some file, the chain of calls isrepo.add
->repo._create_stages
->Stage.create
which detects thatfile.dvc
exists and removes it, so it can be created again.So now we don’t have
file.dvc
because it was removed byStage.create
, we havefile.dvc
in memory as “new” stage that we are creating right now, and becauserepo
’s stages were not invalidated, we also haverepo.stages
==[file.dvc
]. That is why we try to check modified graph here duringadd
we haveself.stages==[
file.dvc]
andnew_stages==[
file.dvc]
. While, actually self.stages should not return it (since it was removed byStage.create
.TLDR When re-
add
ing we remove stage file, but we don’t invalidate cachedrepo.stages
and end up withrepo._collect_graph(['file.dvc', 'file.dvc'])
which has to fail.Solution: invalidate
repo.stages
when removing the stage file.