dvc: dvc run : possible bug with deeply nested dependencies
Please provide information about your setup
ubuntu 18.04, dvc==0.59.2, pip install with miniconda python 3.7
For deeply nested dependencies, it looks like dvc is not tracking them properly in the .dvc files
Following script reproduces the issue:
#!/bin/bash
set -x
set -e
rm -rf dvc_test
mkdir dvc_test && cd dvc_test
mkdir scripts
mkdir -p data/recommended/dataset1/dataset1_proc
echo bar > data/recommended/dataset1/v1.txt
git init
dvc init
echo -e "import sys\
\nwith open(sys.argv[1], 'w') as f: f.write(sys.argv[2])" > scripts/script.py
# Run works
dvc run -y \
    -w $PWD/data/recommended/dataset1/dataset1_proc\
    -f ./data/recommended/dataset1/dataset1_proc/v1.dvc\
    -d ../../../../scripts/script.py \
    -o v1 \
    "mkdir v1 && python ../../../../scripts/script.py v1/v1.txt proc_data"
# Inspecting v1.dvc shows that the script.py dependency is missing on ../
cat data/recommended/dataset1/dataset1_proc/v1.dvc
# Because of the, repro does not work
dvc repro data/recommended/dataset1/dataset1_proc/v1.dvc
Inspecting the v1.dvc file shows:
md5: 4ae72e168a0a6a2f1aaadfb5628640f7
cmd: mkdir v1 && python ../../../../scripts/script.py v1/v1.txt proc_data
deps:
- md5: 791b9c74b1d9308a3226b93a36689dad
  path: ../../../scripts/script.py
outs:
- md5: 188ed6cb603658d01ef7ba8fb7c434fe.dir
  path: v1
  cache: true
  metric: false
  persist: false
+ dvc repro data/recommended/dataset1/dataset1_proc/v1.dvc
Which indicates that the script.py dependency is indeed missing one ../
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 29 (18 by maintainers)
@tdeboissiere Ok, I am able to reproduce this on docker with:
Investigating. Thank you for your patience 🙂
EDIT: Interesting detail is that dvcfile has even less …/ now:
EDIT2: with --no-scm everything stays the same, so it is unlikely Gitpython’s fault.
EDIT3: confirmed pretty old regression, investigating closer…
@tdeboissiere my understanding is that when you run a command with
dvc run/dvc reproit should keep your environment variables unchanged (obvious example $PATH that is used to findpythonand other binaries). It’s not what I see in some cases (that link on Discord). I don’t have an explanation yet - is it DVC, some zsh settings, some specific machine settings - we don’t know yet. But my thinking was - can it be the case here as well? some changes to the environment when you run commands with DVC.@efiop My pleasure, it’s always a treat to get my problems solved here !
@tdeboissiere Merged a fix for this into master, will release a new dvc version with it ASAP. In the meanwhile, you could try installing from master to check if that works for you too. I.e.
Thank you so much for reporting this issue and helping us investigate it! We really appreciate that 🙂
@tdeboissiere Ok, the patch is taking a bit longer, because the bug is quite deep and the proper solution breaks other parts of the code temporarily. Basically, the issue is
os.relpaththat we are using inPathInfo.__str__, which in turn gets used inPathInfo.as_posix()when we are dumping the dvc file afterdvc run. So depending on where you are located, it might resolve relative path differently. E.g. if you are in/home/userand runos.path.relpath("../path")you’ll get../path, but if you are in/then you’ll getpath. That is where your../went missing. The difference between my and your machines is that you were running from/home/user/subdirand I was running from/home/user/git/dvc/subdir. So a workaround would be to simply move your root directory a few levels deeper.ETA for a fixed release is tomorrow.
Ran the following script on ubuntu 18.04 laptopt in
/home/user/debugwithbash debug.shThe only line which is different between
env_before.txtandenv_after.txtis@efiop Nope, still got the same error, on multiple machines