dvc: `dvc repro --dry --allow-missing`: fails on missing data

I tried to update our dvc ci pipeline

Currently we got the following commands (among others).

dvc pull to check if everything is pushed dvc status to check if the dvc status is clean. In other words no repro would be run if one would run dvc repro.

But pulling thats a long time and with the now new --alllow-missing feature i though i can skip that with

dvc data status --not-in-remote --json | grep -v not_in_remote
dvc repro --allow-missing --dry

the first is working like expected. Fails if data was forgotten to be pushed and succeeds if it was. But the later just fails on missing data.

Reproduce

Example: Failure/Success on Machine Two and Three should be synced

Machine One:

  1. dvc repro -f
  2. git add . && git commit -m “repro” && dvc push && git push
  3. dvc repro --allow-missing --dry –> doesnt fail, nothing changed (as expected)

Machine Two: 4. dvc data status --not-in-remote --json | grep -v not_in_remote –> does not fail, everything is pushed and would be pulled 5. dvc repro --allow-missing --dry –> fails on missing data (unexpected)

Machine Three 4. dvc pull 5. dvc status –> succeeds

Expected

On a machine where i did not dvc pull i would expect on a git clean state and a clean dvc data status --not-in-remote --json | grep -v not_in_remote state that dvc repro --allow-missing --dry would succed and show me that no stage had to run.

Environment information

Linux

Output of dvc doctor:

$ dvc doctor
09:16:47  DVC version: 3.13.2 (pip)
09:16:47  -------------------------
09:16:47  Platform: Python 3.10.11 on Linux-5.9.0-0.bpo.5-amd64-x86_64-with-glibc2.35
09:16:47  Subprojects:
09:16:47  	dvc_data = 2.12.1
09:16:47  	dvc_objects = 0.24.1
09:16:47  	dvc_render = 0.5.3
09:16:47  	dvc_task = 0.3.0
09:16:47  	scmrepo = 1.1.0
09:16:47  Supports:
09:16:47  	azure (adlfs = 2023.4.0, knack = 0.11.0, azure-identity = 1.13.0),
09:16:47  	gdrive (pydrive2 = 1.16.1),
09:16:47  	gs (gcsfs = 2023.6.0),
09:16:47  	hdfs (fsspec = 2023.6.0, pyarrow = 12.0.1),
09:16:47  	http (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
09:16:47  	https (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
09:16:47  	oss (ossfs = 2021.8.0),
09:16:47  	s3 (s3fs = 2023.6.0, boto3 = 1.28.17),
09:16:47  	ssh (sshfs = 2023.7.0),
09:16:47  	webdav (webdav4 = 0.9.8),
09:16:47  	webdavs (webdav4 = 0.9.8),
09:16:47  	webhdfs (fsspec = 2023.6.0)
09:16:47  Config:
09:16:47  	Global: /home/runner/.config/dvc
09:16:47  	System: /etc/xdg/dvc
09:16:47  Cache types: <https://error.dvc.org/no-dvc-cache>
09:16:47  Caches: local
09:16:47  Remotes: ssh
09:16:47  Workspace directory: ext4 on /dev/nvme0n1p2
09:16:47  Repo: dvc, git

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 24 (17 by maintainers)

Commits related to this issue

Most upvoted comments

So, to give context, the problem appears if there is a .dvc file in 2.X format:

https://github.com/iterative/example-get-started-experiments/blob/9dba21cbffb0caad939c63db427eea7f16f3c269/data/pool_data.dvc#L1-L5

That is referenced in a dvc.lock in 3.X format as a dependency:

https://github.com/iterative/example-get-started-experiments/blob/9dba21cbffb0caad939c63db427eea7f16f3c269/dvc.lock#L6-L10

As soon as the contents associated with the .dvc are updated, the file will be updated to 3.X format so the problem would disappear.

Do you think we should only compare the hash value and not all hash info? Can’t say from the top of my mind. Would need to take a closer look to see what makes sense

Strictly speaking, I guess there could be a collision where we would be miss identifying 2 different things as being the same 🤷

@daavoo What does that mean?

I mean that we should not consider it modified in the example-get-started-experiments scenario.

Do you think we should only compare the hash value and not all hash info?

Can’t say from the top of my mind. Would need to take a closer look to see what makes sense

So i do i get rid of the dvc2 .dvc files and replace them with their dvc3 counterpart?

Yes, sorry for the confusion @Otterpatsch. I initially thought dvc commit -f would achieve that, but it doesn’t do that today. We are looking into changing that, but for now you would need to do this yourself.

So i fixed the issue (i think) on our side. I basically run dvc repro --allow-missing --dry couple of times to get each time one of the datasets which where still dvc2. Then i readd those and not anymore crashing.

But now the pipeline succeeds even tho i get a the following lines in the command. Which makes sense because i changed a lot of .dvc files which are also in that path.

13:57:33  2023-08-21 11:57:24,369 DEBUG: built tree 'object 880a0f10a0350a3ed636a6a395a7cd4a.dir'
13:57:33  2023-08-21 11:57:24,370 DEBUG: Dependency 'datasets/training-sets' of stage: 'training' changed because it is 'modified'.
13:57:33  2023-08-21 11:57:24,371 DEBUG: stage: 'training' changed.
13:57:33  2023-08-21 11:57:24,384 DEBUG: built tree 'object 880a0f10a0350a3ed636a6a395a7cd4a.dir'
13:57:33  2023-08-21 11:57:24,386 DEBUG: built tree 'object 2ead35ca4cf9b96e0f4ad3cc696e78d7.dir'
13:57:33  2023-08-21 11:57:24,397 DEBUG: built tree 'object 880a0f10a0350a3ed636a6a395a7cd4a.dir'
13:57:33  2023-08-21 11:57:24,397 DEBUG: {'datasets/training-sets': 'modified'}
13:57:33  2023-08-21 11:57:24,408 DEBUG: built tree 'object 880a0f10a0350a3ed636a6a395a7cd4a.dir'
13:57:33  2023-08-21 11:57:24,409 DEBUG: built tree 'object 2ead35ca4cf9b96e0f4ad3cc696e78d7.dir'
13:57:33  Running stage 'training':
13:57:33  > conda env export --prefix .conda-envs/training | grep -v "\(^prefix:\)\|\(^name:\)" > stages/training/exported-conda-env.yaml
13:57:33  > conda run --no-capture --prefix .conda-envs/training/ mmocr train --config_path stages/training/abinet_config_handwriting.py
13:57:33  > conda run --no-capture --prefix .conda-envs/training/ python dependencies/scripts/rename.py model/
13:57:33  > cp -r stages/training/charsets model/
13:57:33  2023-08-21 11:57:24,412 DEBUG: stage: 'training' was reproduced

How can i fix this? Like it seems that i dont use the correct command for my pipeline. I mean the command succeeds but it should fail in a pipeline sense because a repro would be run if i just would use dvc repro.

I believe im missing something similar to the dvc data status one dvc data status --not-in-remote --json | grep -v not_in_remote

which got the grep but not sure how do it for dvc repro --allow-missing --dry so it failes for all kinds of the dependecies.

So i tried: dvc repro --dry --allow-missing | grep -v "Running stage " But it still succeds even tho if i just use grep "Running stage " i get some output

> dvc repro --dry --allow-missing | grep "Running stage "
Running stage 'training':
Running stage 'collect_benchmarks':

Does datasets/benchmark-sets/customer0/2020_11_02.dvc contain the line hash: md5 (that line is only present in 3.x files)?

outs:
- md5: f4eb1691cb23a5160a958274b9b9fb41.dir
  size: 55860614
  nfiles: 5491
  path: '2020_11_02'

seems it does

Also, could you try to delete the site cache dir?

With deleting the /var/tmp/dvc (was existing) error persists

Looks like it is failing in my example because data/pool_data.dvc is in legacy 2.x format, so the hash info doesn’t match the stage dep here:

https://github.com/iterative/dvc/blob/04e891cef929567794ade4e0c2a1bf399666f66e/dvc/stage/__init__.py#L315-L321

The hashes are the same, but debugging shows that the different hash names make it fail:

(Pdb) out.hash_info
HashInfo(name='md5-dos2unix', value='14d187e749ee5614e105741c719fa185.dir', obj_name=None)
(Pdb) dep.hash_info
HashInfo(name='md5', value='14d187e749ee5614e105741c719fa185.dir', obj_name=None)

@Otterpatsch Does datasets/benchmark-sets/customer0/2020_11_02.dvc contain the line hash: md5 (that line is only present in 3.x files)? Also, could you try to delete the site cache dir?

I dont hit any “error” just that notification due to --dry that staged would run. And further notification that some files are missing (dvc tracked). But maybe my assumation that dvc repro --allow-missing --dry should not fail/report everything is fine and uptodate when i use those flag, iff from some other machine that repro was done and pushed successfully is wrong. Im very much confused by now

Just to clarify if i run dvc pull and run dvc status everything is reported as fine.

dvc repro --allow-missing --dry
11:18:32  'datasets/benchmark-sets/customer0/2020_11_02.dvc' didn't change, skipping
...
11:18:32  'datasets/training-sets/customer/customerN/customerN_empty_consignment_field_faxified.dvc' didn't change, skipping
11:18:32  Running stage 'training':
11:18:32  > conda env export --prefix .conda-envs/training | grep -v "\(^prefix:\)\|\(^name:\)" > stages/training/exported-conda-env.yaml
11:18:32  > conda run --no-capture --prefix .conda-envs/training/ mmocr train --config_path stages/training/abinet_config_handwriting.py
11:18:32  > conda run --no-capture --prefix .conda-envs/training/ python dependencies/scripts/rename.py model/
11:18:32  > cp -r stages/training/charsets model/
11:18:32  
11:18:32  Stage 'extract@customer0/2020_11_02/Formularmerkmal_Ansprechpartner' didn't change, skipping
11:18:32  Stage 'extract@customer0/2020_11_02/Formularmerkmal_Beinstueck' didn't change, skipping
11:18:32  Stage 'extract@customer0/2020_11_02/Formularmerkmal_Kommission' didn't change, skipping
11:18:32  Stage 'extract@customer0/2020_11_02/Formularmerkmal_Kundenname' didn't change, skipping
11:18:32  'datasets/benchmark-sets/company/emails_2021-03-22.dvc' didn't change, skipping
11:18:32  ERROR: failed to reproduce 'extract@company/emails_2021-03-22': [Errno 2] No such file or directory: '/var/jenkins_home/workspace/repo_namecompany_MR-20/datasets/benchmark-sets/company/emails_2021-03-22'