dvc: `dvc repro --dry --allow-missing`: fails on missing data
I tried to update our dvc ci pipeline
Currently we got the following commands (among others).
dvc pull
to check if everything is pushed
dvc status
to check if the dvc status is clean. In other words no repro would be run if one would run dvc repro.
But pulling thats a long time and with the now new --alllow-missing feature i though i can skip that with
dvc data status --not-in-remote --json | grep -v not_in_remote
dvc repro --allow-missing --dry
the first is working like expected. Fails if data was forgotten to be pushed and succeeds if it was. But the later just fails on missing data.
Reproduce
Example: Failure/Success on Machine Two and Three should be synced
Machine One:
- dvc repro -f
- git add . && git commit -m “repro” && dvc push && git push
- dvc repro --allow-missing --dry –> doesnt fail, nothing changed (as expected)
Machine Two: 4. dvc data status --not-in-remote --json | grep -v not_in_remote –> does not fail, everything is pushed and would be pulled 5. dvc repro --allow-missing --dry –> fails on missing data (unexpected)
Machine Three 4. dvc pull 5. dvc status –> succeeds
Expected
On a machine where i did not dvc pull
i would expect on a git clean state and a clean dvc data status --not-in-remote --json | grep -v not_in_remote
state that dvc repro --allow-missing --dry
would succed and show me that no stage had to run.
Environment information
Linux
Output of dvc doctor
:
$ dvc doctor
09:16:47 DVC version: 3.13.2 (pip)
09:16:47 -------------------------
09:16:47 Platform: Python 3.10.11 on Linux-5.9.0-0.bpo.5-amd64-x86_64-with-glibc2.35
09:16:47 Subprojects:
09:16:47 dvc_data = 2.12.1
09:16:47 dvc_objects = 0.24.1
09:16:47 dvc_render = 0.5.3
09:16:47 dvc_task = 0.3.0
09:16:47 scmrepo = 1.1.0
09:16:47 Supports:
09:16:47 azure (adlfs = 2023.4.0, knack = 0.11.0, azure-identity = 1.13.0),
09:16:47 gdrive (pydrive2 = 1.16.1),
09:16:47 gs (gcsfs = 2023.6.0),
09:16:47 hdfs (fsspec = 2023.6.0, pyarrow = 12.0.1),
09:16:47 http (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
09:16:47 https (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
09:16:47 oss (ossfs = 2021.8.0),
09:16:47 s3 (s3fs = 2023.6.0, boto3 = 1.28.17),
09:16:47 ssh (sshfs = 2023.7.0),
09:16:47 webdav (webdav4 = 0.9.8),
09:16:47 webdavs (webdav4 = 0.9.8),
09:16:47 webhdfs (fsspec = 2023.6.0)
09:16:47 Config:
09:16:47 Global: /home/runner/.config/dvc
09:16:47 System: /etc/xdg/dvc
09:16:47 Cache types: <https://error.dvc.org/no-dvc-cache>
09:16:47 Caches: local
09:16:47 Remotes: ssh
09:16:47 Workspace directory: ext4 on /dev/nvme0n1p2
09:16:47 Repo: dvc, git
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 1
- Comments: 24 (17 by maintainers)
Commits related to this issue
- repo: allow_missing: Compare hash_info.value instead of has_info. Closes #9818 — committed to iterative/dvc by daavoo a year ago
So, to give context, the problem appears if there is a
.dvc
file in 2.X format:https://github.com/iterative/example-get-started-experiments/blob/9dba21cbffb0caad939c63db427eea7f16f3c269/data/pool_data.dvc#L1-L5
That is referenced in a
dvc.lock
in 3.X format as a dependency:https://github.com/iterative/example-get-started-experiments/blob/9dba21cbffb0caad939c63db427eea7f16f3c269/dvc.lock#L6-L10
As soon as the contents associated with the
.dvc
are updated, the file will be updated to3.X
format so the problem would disappear.Strictly speaking, I guess there could be a collision where we would be miss identifying 2 different things as being the same 🤷
I mean that we should not consider it modified in the example-get-started-experiments scenario.
Can’t say from the top of my mind. Would need to take a closer look to see what makes sense
Yes, sorry for the confusion @Otterpatsch. I initially thought
dvc commit -f
would achieve that, but it doesn’t do that today. We are looking into changing that, but for now you would need to do this yourself.So i fixed the issue (i think) on our side. I basically run
dvc repro --allow-missing --dry
couple of times to get each time one of the datasets which where still dvc2. Then i readd those and not anymore crashing.But now the pipeline succeeds even tho i get a the following lines in the command. Which makes sense because i changed a lot of .dvc files which are also in that path.
How can i fix this? Like it seems that i dont use the correct command for my pipeline. I mean the command succeeds but it should fail in a pipeline sense because a repro would be run if i just would use
dvc repro
.I believe im missing something similar to the dvc data status one
dvc data status --not-in-remote --json | grep -v not_in_remote
which got the grep but not sure how do it for dvc repro --allow-missing --dry so it failes for all kinds of the dependecies.
So i tried:
dvc repro --dry --allow-missing | grep -v "Running stage "
But it still succeds even tho if i just use grep "Running stage " i get some outputseems it does
With deleting the /var/tmp/dvc (was existing) error persists
Looks like it is failing in my example because
data/pool_data.dvc
is in legacy 2.x format, so the hash info doesn’t match the stage dep here:https://github.com/iterative/dvc/blob/04e891cef929567794ade4e0c2a1bf399666f66e/dvc/stage/__init__.py#L315-L321
The hashes are the same, but debugging shows that the different hash names make it fail:
@Otterpatsch Does
datasets/benchmark-sets/customer0/2020_11_02.dvc
contain the linehash: md5
(that line is only present in 3.x files)? Also, could you try to delete the site cache dir?I dont hit any “error” just that notification due to --dry that staged would run. And further notification that some files are missing (dvc tracked). But maybe my assumation that
dvc repro --allow-missing --dry
should not fail/report everything is fine and uptodate when i use those flag, iff from some other machine that repro was done and pushed successfully is wrong. Im very much confused by nowJust to clarify if i run dvc pull and run dvc status everything is reported as fine.