dvc: status: takes too long to get status
Bug Report
Description
I have dvc setup in the root of my project folder, which is at
C:\Users\raylu\Documents\Github\audit-engine
the stage file is established in
resources\WI_Ozaukee_20201103\dvc\precheck\dvc.yaml
I issue this command:
dvc status -R -v -v -v --show-json resources\WI_Ozaukee_20201103\dvc
And I expect that it will walk the subtree under
C:\Users\raylu\Documents\Github\audit-engine\resources\WI_Ozaukee_20201103\dvc
to look for dvc.yaml stage files. Instead, it appears to walk the full tree below
C:\Users\raylu\Documents\Github\audit-engine
and this takes 75 seconds (there is 112 GB of data). But this is just a hunch. We temporarily moved the .dvc folder to inside the folder
C:\Users\raylu\Documents\Github\audit-engine\resources\WI_Ozaukee_20201103\dvc
and it takes only 5.6 seconds (which is still pretty long). This should probably take only a second or two, because getting the etags from the three s3 files is very fast and it needs only to find one stage file. It seems something is wrong here.
Reproduce
To reproduce this, dvc must be configured with no scm, no remote, no cache and use -R in status, so it can find the dvc.yaml stage files. We have only one.
Expected
See above.
Environment information
Output of dvc doctor
:
$ dvc doctor
DVC version: 2.6.4 (pip)
---------------------------------
Platform: Python 3.7.6 on Windows-10-10.0.19041-SP0
Supports:
http (requests = 2.24.0),
https (requests = 2.24.0),
s3 (s3fs = 2021.8.0, boto3 = 1.17.106)
Additional Information (if any): I will attach the profile dump and plot.
Profile Dump
https://cdn.discordapp.com/attachments/882823608949411850/884465153716920380/dump.prof
https://cdn.discordapp.com/attachments/882823608949411850/884467942111203348/image_output.png
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 16 (8 by maintainers)
To clarify, the reason for the current (stage/pipeline collection) behavior is that for
dvc status <target>
,<target>
could be either a directory containing advc.yaml
file, or the output for somedvc.yaml
file outside of<target>
.So if I had a repo with
path/dvc.yaml
containing:Given the command
dvc status path/to/dir
, DVC still has to search the parent directoriespath/
,path/to/
for the correctdvc.yaml
file w/the outputpath/to/dir
instead of only limiting the search topath/to/dir
itself.But I think the issue here is that when using the
-R/--recursive <target>
, the user is explicitly telling DVC to look recursively fordvc.yaml
and.dvc
files inside the target path (meaning it implies that<target>
is not a stage output). So we could potentially skip the parent directory search when using-R
.Would it help to decouple pipeline status (
dvc stage status
) from data status (dvc status
), similar to howdvc add
/dvc stage add
were decoupled?We have decided not to use DVC and have implemented our own similar functionality. Thanks for your time.
Yes, building DAG takes place first before any filtering, and when we build the DAG we collect all possible stages through the entire repo.
I think with
-R
we can just limit that collection to all stages inside the target dir, instead of collecting all stages in the full repo, although maybe this should be a separate flag? (cc @skshetry)Alternatively, you can also just use
.dvcignore
to prevent DVC from traversing any directories that the user already knows will never contain pipeline/dvc files (to speed up the time it takes DVC to build the DAG for an entire repo).