dvc: dvc run : possible bug with deeply nested dependencies

Please provide information about your setup

ubuntu 18.04, dvc==0.59.2, pip install with miniconda python 3.7

For deeply nested dependencies, it looks like dvc is not tracking them properly in the .dvc files

Following script reproduces the issue:

#!/bin/bash

set -x
set -e

rm -rf dvc_test
mkdir dvc_test && cd dvc_test
mkdir scripts
mkdir -p data/recommended/dataset1/dataset1_proc
echo bar > data/recommended/dataset1/v1.txt
git init
dvc init
echo -e "import sys\
\nwith open(sys.argv[1], 'w') as f: f.write(sys.argv[2])" > scripts/script.py

# Run works
dvc run -y \
    -w $PWD/data/recommended/dataset1/dataset1_proc\
    -f ./data/recommended/dataset1/dataset1_proc/v1.dvc\
    -d ../../../../scripts/script.py \
    -o v1 \
    "mkdir v1 && python ../../../../scripts/script.py v1/v1.txt proc_data"

# Inspecting v1.dvc shows that the script.py dependency is missing on ../
cat data/recommended/dataset1/dataset1_proc/v1.dvc

# Because of the, repro does not work
dvc repro data/recommended/dataset1/dataset1_proc/v1.dvc

Inspecting the v1.dvc file shows:

md5: 4ae72e168a0a6a2f1aaadfb5628640f7
cmd: mkdir v1 && python ../../../../scripts/script.py v1/v1.txt proc_data
deps:
- md5: 791b9c74b1d9308a3226b93a36689dad
  path: ../../../scripts/script.py
outs:
- md5: 188ed6cb603658d01ef7ba8fb7c434fe.dir
  path: v1
  cache: true
  metric: false
  persist: false
+ dvc repro data/recommended/dataset1/dataset1_proc/v1.dvc

Which indicates that the script.py dependency is indeed missing one ../

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 29 (18 by maintainers)

Commits related to this issue

Most upvoted comments

@tdeboissiere Ok, I am able to reproduce this on docker with:

docker pull python
docker run --rm -v $(pwd):/test -w /test python ./test_2483.sh

Investigating. Thank you for your patience 🙂

EDIT: Interesting detail is that dvcfile has even less …/ now:

+ cat data/recommended/dataset1/dataset1_proc/v1.dvc                                                                                 
md5: 3e62014bd6e65b4e9de50e642c13bd19                                                                                                
cmd: mkdir v1 && python ../../../../scripts/script.py v1/v1.txt proc_data                                                            
deps:                                                                                                                                
- md5: 791b9c74b1d9308a3226b93a36689dad                                                                                              
  path: ../../scripts/script.py                                                                                                      
outs:                                                                                                                                
- md5: 188ed6cb603658d01ef7ba8fb7c434fe.dir                                                                                          
  path: v1                                                                                                                           
  cache: true                                                                                                                        
  metric: false                                                                                                                      
  persist: false                                                                                                                     
+ dvc repro data/recommended/dataset1/dataset1_proc/v1.dvc                                                                           
WARNING: Dependency 'data/recommended/scripts/script.py' of 'data/recommended/dataset1/dataset1_proc/v1.dvc' changed because it is 'd
eleted'.                                                                                                                             
WARNING: Stage 'data/recommended/dataset1/dataset1_proc/v1.dvc' changed.                                                             
Running command:                                                                                                                     
        mkdir v1 && python ../../../../scripts/script.py v1/v1.txt proc_data                                                         
ERROR: failed to reproduce 'data/recommended/dataset1/dataset1_proc/v1.dvc': missing dependency: data/recommended/scripts/script.py  
                                                                                                                                     
Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!                                             

EDIT2: with --no-scm everything stays the same, so it is unlikely Gitpython’s fault.

EDIT3: confirmed pretty old regression, investigating closer…

@tdeboissiere my understanding is that when you run a command with dvc run/dvc repro it should keep your environment variables unchanged (obvious example $PATH that is used to find python and other binaries). It’s not what I see in some cases (that link on Discord). I don’t have an explanation yet - is it DVC, some zsh settings, some specific machine settings - we don’t know yet. But my thinking was - can it be the case here as well? some changes to the environment when you run commands with DVC.

@efiop My pleasure, it’s always a treat to get my problems solved here !

@tdeboissiere Merged a fix for this into master, will release a new dvc version with it ASAP. In the meanwhile, you could try installing from master to check if that works for you too. I.e.

pip uninstall -y dvc; pip install https://github.com/iterative/dvc

Thank you so much for reporting this issue and helping us investigate it! We really appreciate that 🙂

@tdeboissiere Ok, the patch is taking a bit longer, because the bug is quite deep and the proper solution breaks other parts of the code temporarily. Basically, the issue is os.relpath that we are using in PathInfo.__str__, which in turn gets used in PathInfo.as_posix() when we are dumping the dvc file after dvc run. So depending on where you are located, it might resolve relative path differently. E.g. if you are in /home/user and run os.path.relpath("../path") you’ll get ../path, but if you are in / then you’ll get path. That is where your ../ went missing. The difference between my and your machines is that you were running from /home/user/subdir and I was running from /home/user/git/dvc/subdir. So a workaround would be to simply move your root directory a few levels deeper.

ETA for a fixed release is tomorrow.

  • Ran the same code in a basic docker image on my latpop, same error (miniconda3 python 3.7 ubuntu 18.04, no zsh), same error
  • Ran the same code in a basic docker image on my laptop, (sytem python3 and ubuntu 18.04, no zsh), same error
  • Ran the same code in an ubuntu 18.04 azure VM (no zsh, system python), same error

Ran the following script on ubuntu 18.04 laptopt in /home/user/debug with bash debug.sh

#!/bin/bash                                                                  
                                                                             
set -x                                                                       
set -e                                                                       

rm -rf dvc_test
mkdir dvc_test
cd dvc_test
env > env_before.txt
git init
dvc init
dvc run -o env_after.txt "env > env_after.txt"

The only line which is different between env_before.txt and env_after.txt is

OLDPWD=/home/user/debug # before
OLDPWD=/home/user/debug/dvc_test # after

@efiop Nope, still got the same error, on multiple machines