dvc: Bug: `dvc add` fails with a modified file (or directory) at the end of a list of files

Using DVC 0.71.0 on RHEL6, installed via conda, configured for system-wide hard links (not sure if it’s relevant):

$ dvc --version
0.71.0

$ cat ~/.config/dvc/config
[cache]
protected = true
type = hardlink

I have come across what appears to be a bug, where attempting to dvc add a directory previously under DVC version control whose contents have changed, results in an error, but only when adding along with a list of files, and doesn’t occur if the command is repeated (i.e. after all the other files have been added). I have reproduced this in several project directories and via the following minimal example (courtesy of @mroutis):

dvc init --no-scm
mkdir data
echo "foo" > data/foo
dvc add data
echo "bar" > bar.txt
dvc unprotect data
echo "change" > data/foo
dvc add bar.txt data

This results in the following output:

WARNING: Output 'data' of 'data.dvc' changed because it is 'modified'
100%|██████████|Add                                       2.00/2.00 [00:02<00:00,  1.08s/file]
ERROR: file/directory 'data' is specified as an output in more than one stage: data.dvc
    data.dvc

But if the command is re-run:

$ dvc add bar.txt data
Stage is cached, skipping.
100% Add                                                  1.00/1.00 [00:01<00:00,  1.96s/file]

So it appears dvc is somehow mishandling the list of files. Of course, the expected behavior is that it will add the directory successfully on the first try.

Thanks for any effort to track down the source of the bug.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 16 (10 by maintainers)

Commits related to this issue

Most upvoted comments

@iterative/engineering it seems like an important bug. Should we make it p0?

can confirm that problem still exists on master:

#!/bin/bash

rm -rf repo
mkdir repo

pushd repo
git init --quiet
dvc init -q

set -ex

mkdir data

echo foo>>data/foo
dvc add -q data

echo bar>>bar
echo change>>data/foo

dvc add bar data

Terrific. Thanks for the fix!

@pared @jaredsampson Released in 0.86.4, please upgrade and give it a try 🙂

@jaredsampson the fix is done and should be included in next release

Ok, so little investigation helped me to narrow down the issue. The bug shows up when we are re-add-ing existing DVC-tracked output, but only after calling dvc.stages first. For example: Latest version i checked with: 0.86.1

def test_re_add(tmp_dir, scm, dvc):
    tmp_dir.dvc_gen( {"file": "file content"})

    tmp_dir.gen(
            {"file": "modified content"}
    )
    dvc.stages
    dvc.add(["file"])

We will get the same OutputDuplicationError.

  1. Why this is happening? repo.stages is cached property, so repo is storing file.dvc. The problem is that when we are re-add-ing some file, the chain of calls is repo.add -> repo._create_stages -> Stage.create which detects that file.dvc exists and removes it, so it can be created again.

So now we don’t have file.dvc because it was removed by Stage.create, we have file.dvc in memory as “new” stage that we are creating right now, and because repo’s stages were not invalidated, we also have repo.stages==[file.dvc]. That is why we try to check modified graph here during add we have self.stages==[file.dvc] and new_stages==[file.dvc]. While, actually self.stages should not return it (since it was removed by Stage.create.

TLDR When re-adding we remove stage file, but we don’t invalidate cached repo.stages and end up with repo._collect_graph(['file.dvc', 'file.dvc']) which has to fail.

Solution: invalidate repo.stages when removing the stage file.