dvc: dvc pull: fails on a cloned repository that contains dvc-imported files
Bug Report
dvc pull: fails on a cloned repository that contains dvc-imported files
Description
We have a data registry repo (datasets
), which contains different datasets in folders eg. dataset1
and then another model
repo, that dvc imports a folder, then if we clone this model
repo and run dvc pull
then it fails to pull the data from the remote. I have tried to replicate this error running in a docker container based on images: python:3.10
and ubuntu:22.04
, and latest DVC version 3.13.2
and python: 3.10.12
, but I was unable to and it seems to work. However, on my current machine running ubuntu 22.04 LTS
, it fails:
$ uname -a
Linux mzurad 5.19.0-50-generic #50-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul 10 18:24:29 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
$ python3 -V
Python 3.10.12
$ git --version
git version 2.34.1
Reproduce with or without external cache
# remove comments to test with external cache
rm -rf testing-dvc
mkdir testing-dvc
cd testing-dvc
mkdir storage
#mkdir cache
function init_dvc {
git init .
dvc init
#dvc cache dir ../cache
#dvc config cache.type symlink
dvc remote add -d local ../storage
}
function create_datasets {
mkdir datasets
cd datasets
init_dvc
mkdir dataset1
cd dataset1
echo {} > gt.json
mkdir images
cd images
echo 1 > img1.png
echo 2 > img2.png
cd ../..
dvc add -vvv dataset1/*
git add .
git commit -m 'adding dataset1'
dvc push -vvv
cd ..
}
function create_model {
mkdir model
cd model
init_dvc
mkdir data
dvc import -vvv -o data/raw --rev main ../datasets dataset1
git add .
git commit -m 'dvc import dataset1'
dvc push -vvv
dvc status -vvv
dvc data status -vvv
cd ..
}
create_datasets
create_model
git clone model model_cloned
cd model_cloned
#dvc cache dir --unset
dvc pull
dvc doctor
cd ..
tree
Expected
Expected output is that model_cloned
correctly contains the same files as model
, therefore the output of tree testing-dvc
, should be:
.
├── datasets
│ └── dataset1
│ ├── gt.json
│ ├── gt.json.dvc
│ ├── images
│ │ ├── img1.png
│ │ └── img2.png
│ └── images.dvc
├── model
│ └── data
│ ├── raw
│ │ ├── gt.json
│ │ └── images
│ │ ├── img1.png
│ │ └── img2.png
│ └── raw.dvc
├── model_cloned
│ └── data
│ ├── raw
│ │ ├── gt.json
│ │ └── images
│ │ ├── img1.png
│ │ └── img2.png
│ └── raw.dvc
└── storage/files/md5
├── 26
│ └── ab0db90d72e28ad0ba1e22ee510510
├── 8a
│ └── 80554c91d9fca8acb82f023de02f11
├── b0
│ └── 26324c6904b2a9cb4b88d6d61c81d1
└── fb
└── 1baa69ab7cc94156ef79a5a674bd8b.dir
but instead it errors out and here is the detailed output of dvc pull
inside model_cloned
$ dvc pull -vvv
2023-08-10 02:54:21,186 DEBUG: v3.13.2 (pip), CPython 3.10.12 on Linux-5.19.0-50-generic-x86_64-with-glibc2.35
2023-08-10 02:54:21,186 DEBUG: command: /home/mzurad/code/testing/.venv/bin/dvc pull -vvv
2023-08-10 02:54:21,186 TRACE: Namespace(quiet=0, verbose=3, cprofile=False, cprofile_dump=None, yappi=False, yappi_separate_threads=False, viztracer=False, viztracer_depth=None, viztracer_async=False, pdb=False, instrument=False, instrument_open=False, show_stack=False, cd='.', cmd='pull', jobs=None, targets=[], remote=None, all_branches=False, all_tags=False, all_commits=False, force=False, with_deps=False, recursive=False, run_cache=False, glob=False, allow_missing=False, func=<class 'dvc.commands.data_sync.CmdDataPull'>, parser=DvcParser(prog='dvc', usage=None, description='Data Version Control', formatter_class=<class 'argparse.RawTextHelpFormatter'>, conflict_handler='error', add_help=False))
2023-08-10 02:54:21,342 TRACE: 705.13 mks in collecting stages from /home/mzurad/code/testing/testing-dvc/model_cloned
2023-08-10 02:54:21,351 TRACE: 8.68 ms in collecting stages from /home/mzurad/code/testing/testing-dvc/model_cloned/data
2023-08-10 02:54:21,377 TRACE: 6.64 mks in collecting stages from /
2023-08-10 02:54:21,379 TRACE: 1.31 ms in collecting stages from /dataset1
2023-08-10 02:54:21,392 WARNING: No file hash info found for '/home/mzurad/code/testing/testing-dvc/model_cloned/data/raw/.gitignore'. It won't be created.
2023-08-10 02:54:21,392 DEBUG: failed to create '/home/mzurad/code/testing/testing-dvc/model_cloned/data/raw/.gitignore' from 'None'
Traceback (most recent call last):
File "/home/mzurad/code/testing/.venv/lib/python3.10/site-packages/dvc_data/index/checkout.py", line 90, in _create_files
src_fs, src_path = storage_obj.get(entry)
File "/home/mzurad/code/testing/.venv/lib/python3.10/site-packages/dvc_data/index/index.py", line 173, in get
raise ValueError
ValueError
2023-08-10 02:54:21,393 WARNING: No file hash info found for '/home/mzurad/code/testing/testing-dvc/model_cloned/data/raw/gt.json'. It won't be created.
2023-08-10 02:54:21,393 DEBUG: failed to create '/home/mzurad/code/testing/testing-dvc/model_cloned/data/raw/gt.json' from 'None'
Traceback (most recent call last):
File "/home/mzurad/code/testing/.venv/lib/python3.10/site-packages/dvc_data/index/checkout.py", line 90, in _create_files
src_fs, src_path = storage_obj.get(entry)
File "/home/mzurad/code/testing/.venv/lib/python3.10/site-packages/dvc_data/index/index.py", line 173, in get
raise ValueError
ValueError
2023-08-10 02:54:21,393 WARNING: No file hash info found for '/home/mzurad/code/testing/testing-dvc/model_cloned/data/raw/images/img2.png'. It won't be created.
2023-08-10 02:54:21,393 DEBUG: failed to create '/home/mzurad/code/testing/testing-dvc/model_cloned/data/raw/images/img2.png' from 'None'
Traceback (most recent call last):
File "/home/mzurad/code/testing/.venv/lib/python3.10/site-packages/dvc_data/index/checkout.py", line 90, in _create_files
src_fs, src_path = storage_obj.get(entry)
File "/home/mzurad/code/testing/.venv/lib/python3.10/site-packages/dvc_data/index/index.py", line 173, in get
raise ValueError
ValueError
2023-08-10 02:54:21,393 WARNING: No file hash info found for '/home/mzurad/code/testing/testing-dvc/model_cloned/data/raw/images/img1.png'. It won't be created.
2023-08-10 02:54:21,393 DEBUG: failed to create '/home/mzurad/code/testing/testing-dvc/model_cloned/data/raw/images/img1.png' from 'None'
Traceback (most recent call last):
File "/home/mzurad/code/testing/.venv/lib/python3.10/site-packages/dvc_data/index/checkout.py", line 90, in _create_files
src_fs, src_path = storage_obj.get(entry)
File "/home/mzurad/code/testing/.venv/lib/python3.10/site-packages/dvc_data/index/index.py", line 173, in get
raise ValueError
ValueError
2023-08-10 02:54:21,393 DEBUG: Removing '/home/mzurad/code/testing/testing-dvc/model_cloned/data/raw'
Everything is up to date.
2023-08-10 02:54:21,393 ERROR: failed to pull data from the cloud - Checkout failed for following targets:
data/raw
Is your cache up to date?
<https://error.dvc.org/missing-files>
Traceback (most recent call last):
File "/home/mzurad/code/testing/.venv/lib/python3.10/site-packages/dvc/commands/data_sync.py", line 31, in run
stats = self.repo.pull(
File "/home/mzurad/code/testing/.venv/lib/python3.10/site-packages/dvc/repo/__init__.py", line 64, in wrapper
return f(repo, *args, **kwargs)
File "/home/mzurad/code/testing/.venv/lib/python3.10/site-packages/dvc/repo/pull.py", line 43, in pull
stats = self.checkout(
File "/home/mzurad/code/testing/.venv/lib/python3.10/site-packages/dvc/repo/__init__.py", line 64, in wrapper
return f(repo, *args, **kwargs)
File "/home/mzurad/code/testing/.venv/lib/python3.10/site-packages/dvc/repo/checkout.py", line 208, in checkout
raise CheckoutError([relpath(out_path) for out_path in failed], stats)
dvc.exceptions.CheckoutError: Checkout failed for following targets:
data/raw
Is your cache up to date?
<https://error.dvc.org/missing-files>
2023-08-10 02:54:21,396 DEBUG: Analytics is enabled.
2023-08-10 02:54:21,415 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmpr1kfn_ul']'
2023-08-10 02:54:21,416 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmpr1kfn_ul']'
Regression testing
I have run the same test for all version since 2.45
with the following script:
VERSIONS="2.45.0 2.45.1 2.46.0 2.47.0 2.47.1 2.47.2 2.48.0 2.49.0 2.50.0 2.51.0 2.52.0 2.53.0 2.54.0 2.55.0 2.56.0 2.57.0 2.57.1 2.57.2 2.57.3 2.58.0 2.58.1 2.58.2 3.0.0a0 3.0.0a1 3.0.0a2 3.0.0 3.1.0 3.2.0 3.2.1 3.2.2 3.2.3 3.2.4 3.3.0 3.3.1 3.4.0 3.5.0 3.5.1 3.6.0 3.7.0 3.8.0 3.8.1 3.9.0 3.9.1 3.10.0 3.10.1 3.11.0 3.11.1 3.12.0 3.13.0 3.13.1 3.13.2 3.13.3"
rm -rf logs
mkdir logs
TEST="test -e"
#test for symlinks when using external cache
#TEST="test -L"
for version in $VERSIONS;
do
rm -rf .venv
python3 -m venv .venv > logs/$version.log 2>&1
source .venv/bin/activate
pip install dvc==$version > logs/$version.log 2>&1
bash reproduce.sh > logs/$version.log 2>&1
cd testing-dvc/model_cloned/data
($TEST raw/gt.json && $TEST raw/images/img2.png && $TEST raw/images/img1.png) \
&& echo "${version} WORKS" \
|| echo "${version} FAILED"
cd ../../..
done
which gives us the following versions that work and fail:
version | without external cache | with external cache |
---|---|---|
2.45.0 | WORKS | WORKS |
… | WORKS | WORKS |
2.52.0 | WORKS | WORKS |
2.53.0 | WORKS | FAILED |
2.54.0 | WORKS | FAILED |
2.55.0 | WORKS | FAILED |
2.56.0 | WORKS | FAILED |
2.57.0 | FAILED | WORKS |
2.57.1 | FAILED | FAILED |
… | FAILED | FAILED |
2.58.2 | FAILED | FAILED |
3.0.0a0 | FAILED | WORKS |
3.0.0a1 | FAILED | FAILED |
… | FAILED | FAILED |
3.13.2 | FAILED | FAILED |
3.13.3 | FAILED | FAILED |
so it seems to work for some older 2.x
versions:
Environment information
Output of dvc doctor
on DVC version: 3.13.2
:
$ dvc doctor
DVC version: 3.13.2 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-5.19.0-50-generic-x86_64-with-glibc2.35
Subprojects:
dvc_data = 2.12.2
dvc_objects = 0.25.0
dvc_render = 0.5.3
dvc_task = 0.3.0
scmrepo = 1.1.0
Supports:
http (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.5, aiohttp-retry = 2.8.3)
Config:
Global: /home/mzurad/.config/dvc
System: /etc/xdg/xdg-ubuntu/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/ubuntu--vg-root
Caches: local
Remotes: local
Workspace directory: ext4 on /dev/mapper/ubuntu--vg-root
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/737e6c27473cb6374a1e6df03130e328
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 20 (14 by maintainers)
Commits related to this issue
- sqlite: delete_node: don't assume the key is cached in self._ids Related https://github.com/iterative/dvc/issues/9826 — committed to iterative/sqltrie by efiop 9 months ago
- sqlite: delete_node: don't assume the key is cached in self._ids Related https://github.com/iterative/dvc/issues/9826 — committed to iterative/sqltrie by efiop 9 months ago
- deps: bump dvc-data to >=2.18.1 To ensure people are getting a fix related to iterative/dvc#9826 — committed to iterative/dvc by efiop 9 months ago
- deps: bump dvc-data to >=2.18.1 To ensure people are getting a fix related to iterative/dvc#9826 — committed to iterative/dvc by efiop 9 months ago
Finally think I found a way to reproduce this (or something similar) internally. @iterative/dvc You need to configure access to the aws sandbox and then run this script:
Here’s the output I see:
Works for both external and internal caches, also on the newly released
3.14
. Thanks a lot !Can I remove all the directories inside
/var/tmp/dvc/repo
? What are they for ? If they can introduce hard to debug bugs like this one, should they be perhaps nuked when you install a new version, although I am not sure how that could help. Or at least there should be docs somewhere to suggest removing this directory.On another note, I’ve been using DVC for a few months now, and my initial impression is that new releases are happening perhaps too frequently and often breaking core functionality. Don’t get me wrong, but shouldn’t there be more focus on stability ? I already received some feedback from colleagues that get quite hesitant about using DVC at all, when they see these types of bugs.
@dberenbaum As promised, I’ll look into dropping
fetch
cache indrop_data_index
and will close it.EDIT: doesn’t seem like we need to for now. Closing.
Ok, so there were a few things here, but the main one is really https://github.com/iterative/dvc-data/pull/447/commits/99fd441e2f3fa69939a25d00eb5714e640898fe1 , where we were thinking that directory entries were supposed to be loaded later, but they were not (e.g. because of missing objects or missing creds or auth error or something else), and that was then cached in
site_cache_dir
and was not affected bydrop_data_index
because it is not undertree
prefix (which it probably should drop, taking a look).After https://github.com/iterative/dvc-data/pull/447 , those repro scripts from above start to work for me just fine. Will take a final look in the morning.
@dberenbaum Thanks again for the repro script, that helps a lot! 🙏
Here is a simplified local version that shows the same problem and narrows it down:
So it is index for import that is not reloaded for the second pull after the first one failed. Looking into it…
Nice, thank you for the repro script! It looks like a similar problem to one that was recently raised here. The whole thread is long and not all related, but that comment looks similar to what you raise.
@efiop WDYT? Below I slightly adjust the script to do
dvc import --no-download
followed bydvc pull
, which also unexpectedly fails.I tried my hand at a minimum reproducible error with a Dockerfile. It does produce an error but it is not the same as the original post. Let me know if I am missing anything or should move to a new issue:
Running
docker build --progress=plain .
gives me the following output: