dvc: Running `exp queue status` is very slow

Running exp queue status is very slow.

I noticed that after submitting 200 run jobs the queue, the dvc exp queue status became very slow. In the order of 5 minutes for giving a result.

dvc doctor:

DVC version: 2.36.0 (pip)
---------------------------------
Platform: Python 3.10.8 on Linux-6.0.11-arch1-1-x86_64-with-glibc2.36
Subprojects:
	dvc_data = 0.28.3
	dvc_objects = 0.14.0
	dvc_render = 0.0.14
	dvc_task = 0.1.6
	dvclive = 1.1.0
	scmrepo = 0.1.4
Supports:
	azure (adlfs = 2022.10.0, knack = 0.10.0, azure-identity = 1.11.0),
	http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sdb1
Caches: local
Remotes: azure, local
Workspace directory: ext4 on /dev/sdb1
Repo: dvc, git

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 20 (9 by maintainers)

Most upvoted comments

I think it is important to provide instructions (or a command line comment) to help clean up the situation that @gregstarr described. I experienced the same issue and it made my DVC project directory unusable. I was able to recover, but I can’t remember exactly what I did. Perhaps the solution is in one of the comments above.

The main performance issue here is w/having too many celery message files (since we have to iterate over them for things like queue status). Doing the garbage collection to clean up messages which are either expired or irrelevant is implemented in dvc-task now (see linked PR).

On the DVC end, we can add something like exp clean so users can force us to cleanup things we know we don’t need, but for the celery messages in particular we can also just automatically do it in the background when queue workers exit.

@gregstarr removing .dvc/tmp/exps will not remove any experiments that have already been finished (but it will remove logs for those experiments).

I am seeing this problem as well.

I have 20-30 experiments on the commit, some queued, some failed, some finished successfully and some running. dvc exp show, dvc queue status and dvc queue logs <task> all take a long time, over 5 minutes.

Running from the command line on a linux server RHEL 7.4

not 100% sure what the storage configuration is, e.g. NAS, HDD, SSD, etc.

EDIT: just ran dvc queue status, turns out I had 68 and it took about 10 minutes to finish.

here is the dump: dump prof

I had to rename it as a png to get it to upload, but it was generated as you requested above. just change the extension back to .prof

Is there a way to “wipe” certain folders / files on disk and remove all remains from queued tasks, so I can use my dvc repo again ?

@behrica You could force a “wipe” by removing .dvc/tmp/exps (https://dvc.org/doc/user-guide/project-structure/internal-files#internal-directories-and-files) if you are certain that there is no important information there