dvc: dvc and git does not behave the same with "!" and "**"

Consider the following project structure

  • data
    • data1
      • file1
      • file1.dvc
    • data2
      • file2
      • file2.dvc
  • .gitignore

.gitignore is as follows:

data/**
!data/*/
!*.dvc

git status gives: image

while dvc push gives: image

I expect to git and dvc behave the same with gitignore.

  • dvc: 2.8.3
  • python: 3.7

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 18 (8 by maintainers)

Most upvoted comments

@pmrowla Do we need to open a dulwich issue for this?

Sorry for late reply.

in our testcase, if we re-include ‘data1’ directory by !data/*/, dvc ignores .dvc files inside data1 I add some debug code to dvc and tried two examples, In dvc push:

$ dvc push
/Users/gao/Code/test/ignore/.dvc/config.local ignore status is True
/Users/gao/Code/test/ignore/.dvc/tmp ignore status is True
/Users/gao/Code/test/ignore/.dvc/cache ignore status is True
/Users/gao/Code/test/ignore/data/ ignore status is True
Everything is up to date.

While in dvc add data/data2/b

$ dvc add data/data2/b
/Users/gao/Code/test/ignore/.dvc/config.local ignore status is True
/Users/gao/Code/test/ignore/.dvc/tmp ignore status is True
/Users/gao/Code/test/ignore/.dvc/cache ignore status is True
/Users/gao/Code/test/ignore/data/data2/b.dvc ignore status is False
/Users/gao/Code/test/ignore/data/ ignore status is True
Adding...                                                                                                                                                                                                           /Users/gao/Code/test/ignore/data/data2/b ignore status is True
/Users/gao/Code/test/ignore/data/data2/b.dvc ignore status is False
100% Adding...|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|1/1 [00:00, 31.69file/s]

To track the changes with git, run:

	git add data/data2/b.dvc

To enable auto staging, run:

	dvc config core.autostage true

And if we change !data/*/ to !data/**/

$ dvc push
/Users/gao/Code/test/ignore/.dvc/config.local ignore status is True
/Users/gao/Code/test/ignore/.dvc/tmp ignore status is True
/Users/gao/Code/test/ignore/.dvc/cache ignore status is True
/Users/gao/Code/test/ignore/data/ ignore status is False
/Users/gao/Code/test/ignore/data/data1/ ignore status is False
/Users/gao/Code/test/ignore/data/data2/ ignore status is False
/Users/gao/Code/test/ignore/data/data1/c.dvc ignore status is False
/Users/gao/Code/test/ignore/data/data1/c.dvc ignore status is False
/Users/gao/Code/test/ignore/data/data1/c.dvc ignore status is False
/Users/gao/Code/test/ignore/data/data1/a.dvc ignore status is False
/Users/gao/Code/test/ignore/data/data1/a.dvc ignore status is False
/Users/gao/Code/test/ignore/data/data1/a.dvc ignore status is False
/Users/gao/Code/test/ignore/data/data2/b.dvc ignore status is False
/Users/gao/Code/test/ignore/data/data2/b.dvc ignore status is False
/Users/gao/Code/test/ignore/data/data2/b.dvc ignore status is False
3 files pushed

So I guess there are two problems:

  1. Our backends Dulwich gives the different results with
# with `!data/*/`
$ dulwich check-ignore data/
data/
# with `!data/**/`
$ dulwich check-ignore data/

While for the Git:

# with `!data/*/`
$ git check-ignore data/
$
# with `!data/**/`
$ git check-ignore data/
$

They give the same result.

  1. DVC has a different logic in different commands (add work properly while push and commit are not)

And for the logic of gitignore, the following from the thread is quite clear I think

  • Git opens and reads the working tree directory. For each file or directory that is actually present here, Git checks it against the ignore rules. Some rules match only directories and others match both directories and files. Some rules say “do ignore” and some say “do not ignore”.

  • The last applicable rule wins.

  • If this is a file and the file is ignored, it’s ignored. Unless, that is, it’s in the index already, because then it’s tracked and can’t be ignored.

  • If this is a directory and the directory is ignored, it’s not even opened and read. It’s not in the index because directories are never in the index (at least nominally). If it is opened and read, the entire set of rules here apply recursively.

Whether it is was a bug or a bug fix, some commits reverted, and a test case added 😃

By the way, I think there is a separate issue with dvc.

In our testcase, if we re-include ‘data1’ directory by !data/*/, dvc ignores .dvc files inside data1; but if it is done by !data/**/, dvc behaves as expected.

In either of cases, .dvc files inside data1 directory are not ignored by git and the check-ignore output is as follows:

$ git check-ignore -v data/data1/file1.dvc
.gitignore:3:!/data/**/*.dvc   data/data1/file1.dvc

I used another git version, 2.17.1.

If I have understood correctly, .dvc files are ignored in neither of my nor dulwich’s format; but I don’t get why dvc does not see .dvc files in my format.

Here is the thread.

Not yet. However, I’ve just sent an email to git mailing list, describing the issue.