dvc: status: takes too long to get status

Bug Report

Description

I have dvc setup in the root of my project folder, which is at

C:\Users\raylu\Documents\Github\audit-engine

the stage file is established in

resources\WI_Ozaukee_20201103\dvc\precheck\dvc.yaml

I issue this command:

dvc status -R -v -v -v --show-json  resources\WI_Ozaukee_20201103\dvc

And I expect that it will walk the subtree under

C:\Users\raylu\Documents\Github\audit-engine\resources\WI_Ozaukee_20201103\dvc

to look for dvc.yaml stage files. Instead, it appears to walk the full tree below

C:\Users\raylu\Documents\Github\audit-engine

and this takes 75 seconds (there is 112 GB of data). But this is just a hunch. We temporarily moved the .dvc folder to inside the folder

C:\Users\raylu\Documents\Github\audit-engine\resources\WI_Ozaukee_20201103\dvc

and it takes only 5.6 seconds (which is still pretty long). This should probably take only a second or two, because getting the etags from the three s3 files is very fast and it needs only to find one stage file. It seems something is wrong here.

Reproduce

To reproduce this, dvc must be configured with no scm, no remote, no cache and use -R in status, so it can find the dvc.yaml stage files. We have only one.

Expected

See above.

Environment information

Output of dvc doctor:

$ dvc doctor

DVC version: 2.6.4 (pip)
---------------------------------
Platform: Python 3.7.6 on Windows-10-10.0.19041-SP0
Supports:
        http (requests = 2.24.0),
        https (requests = 2.24.0),
        s3 (s3fs = 2021.8.0, boto3 = 1.17.106)

Additional Information (if any): I will attach the profile dump and plot.

Profile Dump

https://cdn.discordapp.com/attachments/882823608949411850/884465153716920380/dump.prof

https://cdn.discordapp.com/attachments/882823608949411850/884467942111203348/image_output.png

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 16 (8 by maintainers)

Most upvoted comments

To clarify, the reason for the current (stage/pipeline collection) behavior is that for dvc status <target>, <target> could be either a directory containing a dvc.yaml file, or the output for some dvc.yaml file outside of <target>.

So if I had a repo with path/dvc.yaml containing:

stages:
  foo:
    outs:
        path/to/dir

Given the command dvc status path/to/dir, DVC still has to search the parent directories path/, path/to/ for the correct dvc.yaml file w/the output path/to/dir instead of only limiting the search to path/to/dir itself.

But I think the issue here is that when using the -R/--recursive <target>, the user is explicitly telling DVC to look recursively for dvc.yaml and .dvc files inside the target path (meaning it implies that <target> is not a stage output). So we could potentially skip the parent directory search when using -R.

Would it help to decouple pipeline status (dvc stage status) from data status (dvc status), similar to how dvc add / dvc stage add were decoupled?

We have decided not to use DVC and have implemented our own similar functionality. Thanks for your time.

Yes, building DAG takes place first before any filtering, and when we build the DAG we collect all possible stages through the entire repo.

I think with -R we can just limit that collection to all stages inside the target dir, instead of collecting all stages in the full repo, although maybe this should be a separate flag? (cc @skshetry)


Alternatively, you can also just use .dvcignore to prevent DVC from traversing any directories that the user already knows will never contain pipeline/dvc files (to speed up the time it takes DVC to build the DAG for an entire repo).