dvc: `queue`: checkpoint experiments hanging

Bug Report

Description

There seem to be some lock-based issues when running checkpoint experiments from the queue with the VS Code extension. This happens even when a single worker is processing the queue. As the worker is finishing an experiment (after the last checkpoint has been written) or starting a new experiment (before a checkpoint has been written) dvc exp show -n 3 --json takes the .dvc/tmp/exps/rwlock.lock. The exp show process hangs for an indefinite amount of time and the queue worker does not return for a very long time. This situation only seems to be resolved by killing the exp show process.

Reproduce

  1. Open vscode-dvc-demo inside VS Code with the extension installed.
  2. Queue several experiments (5 seems to be enough)
  3. Start the experiments queue
  4. Queue will hang (likely after the first experiment has completed)

Expected

Queue should happily process tasks without hanging.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.38.1 (pip)
---------------------------------
Platform: Python 3.10.6 on macOS-13.1-arm64-arm-64bit
Subprojects:
        dvc_data = 0.28.4
        dvc_objects = 0.14.0
        dvc_render = 0.0.15
        dvc_task = 0.1.8
        dvclive = 1.3.2
        scmrepo = 0.1.4
Supports:
        http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2022.11.0, boto3 = 1.24.59)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: https
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git

Additional Information (if any):

Logs from the worker
[2023-01-04 21:00:37,179: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost'
 
 -------------- dvc-exp-bf589e-1@localhost v5.2.7 (dawn-chorus)
--- ***** ----- 
-- ******* ---- macOS-13.1-arm64-arm-64bit 2023-01-04 21:00:37
- *** --- * --- 
- ** ---------- [config]
- ** ---------- .> app:         dvc-exp-local:0x10721eda0
- ** ---------- .> transport:   filesystem://localhost//
- ** ---------- .> results:     file:///Users/mattseddon/projects/vscode-dvc/demo/.dvc/tmp/exps/celery/result
- *** --- * --- .> concurrency: 1 (thread)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** ----- 
 -------------- [queues]
                .> celery           exchange=celery(direct) key=celery
                

[tasks]
  . dvc.repo.experiments.queue.tasks.cleanup_exp
  . dvc.repo.experiments.queue.tasks.collect_exp
  . dvc.repo.experiments.queue.tasks.run_exp
  . dvc.repo.experiments.queue.tasks.setup_exp
  . dvc_task.proc.tasks.run

[2023-01-04 21:00:37,195: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost'
[2023-01-04 21:00:37,195: INFO/MainProcess] Connected to filesystem://localhost//
[2023-01-04 21:00:37,197: INFO/MainProcess] dvc-exp-bf589e-1@localhost ready.
[2023-01-04 21:00:37,197: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[67e8b5cd-0aa2-41cb-ac4f-fd6abf0498ab] received
[2023-01-04 21:00:38,318: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[67e8b5cd-0aa2-41cb-ac4f-fd6abf0498ab] succeeded in 1.1205608329999848s: None
[2023-01-04 21:00:39,208: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[d2412fb4-f676-4da7-bb8f-30eaa86e8821] received
[2023-01-04 21:00:39,836: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[d2412fb4-f676-4da7-bb8f-30eaa86e8821] succeeded in 0.6271973750000086s: None
[2023-01-04 21:00:40,078: INFO/MainProcess] monitor: watching celery worker 'dvc-exp-bf589e-1@localhost'
[2023-01-04 21:00:54,120: INFO/MainProcess] monitor: shutting down due to empty queue.
[2023-01-04 21:00:54,122: INFO/MainProcess] monitor: done
[2023-01-04 21:00:54,287: WARNING/MainProcess] Got shutdown from remote
[2023-01-04 21:01:15,471: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost'
 
 -------------- dvc-exp-bf589e-1@localhost v5.2.7 (dawn-chorus)
--- ***** ----- 
-- ******* ---- macOS-13.1-arm64-arm-64bit 2023-01-04 21:01:15
- *** --- * --- 
- ** ---------- [config]
- ** ---------- .> app:         dvc-exp-local:0x10342a470
- ** ---------- .> transport:   filesystem://localhost//
- ** ---------- .> results:     file:///Users/mattseddon/projects/vscode-dvc/demo/.dvc/tmp/exps/celery/result
- *** --- * --- .> concurrency: 1 (thread)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** ----- 
 -------------- [queues]
                .> celery           exchange=celery(direct) key=celery
                

[tasks]
  . dvc.repo.experiments.queue.tasks.cleanup_exp
  . dvc.repo.experiments.queue.tasks.collect_exp
  . dvc.repo.experiments.queue.tasks.run_exp
  . dvc.repo.experiments.queue.tasks.setup_exp
  . dvc_task.proc.tasks.run

[2023-01-04 21:01:15,488: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost'
[2023-01-04 21:01:15,488: INFO/MainProcess] Connected to filesystem://localhost//
[2023-01-04 21:01:15,490: INFO/MainProcess] dvc-exp-bf589e-1@localhost ready.
[2023-01-04 21:01:15,491: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[0291724f-f8ea-49d7-9eb8-0ea014fb28db] received
[2023-01-04 21:01:16,363: INFO/MainProcess] monitor: watching celery worker 'dvc-exp-bf589e-1@localhost'
[2023-01-04 21:01:34,616: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[0291724f-f8ea-49d7-9eb8-0ea014fb28db] succeeded in 19.124719583s: None
[2023-01-04 21:01:36,592: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[bdcb86ff-40c2-40ec-95e1-9fe950503a4a] received
ERROR: Failed to collect experiment - 'refs/exps/exec/EXEC_CHECKPOINT' is busy, it is being blocked by:
  (PID 4390): /Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/__main__.py exp show -n 3 --json
  (PID 4390): /Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/__main__.py exp show -n 3 --json

If there are no processes with such PIDs, you can manually remove '/Users/mattseddon/projects/vscode-dvc/demo/.dvc/tmp/exps/rwlock' and try again.
[2023-01-04 21:04:54,423: ERROR/MainProcess] Failed to collect experiment
Traceback (most recent call last):
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/repo/experiments/queue/tasks.py", line 70, in collect_exp
    BaseStashQueue.collect_executor(
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/repo/experiments/queue/base.py", line 657, in collect_executor
    results = cls.collect_git(exp, executor, exec_result)
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/funcy/decorators.py", line 45, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/funcy/flow.py", line 127, in retry
    return call()
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/funcy/decorators.py", line 66, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/repo/experiments/queue/base.py", line 635, in collect_git
    with get_exp_rwlock(exp.repo, writes=refs):
  File "/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/rwlock.py", line 209, in rwlock
    _check_blockers(
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/rwlock.py", line 123, in _check_blockers
    raise LockError(
dvc.lock.LockError: 'refs/exps/exec/EXEC_CHECKPOINT' is busy, it is being blocked by:
  (PID 4390): /Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/__main__.py exp show -n 3 --json
  (PID 4390): /Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/__main__.py exp show -n 3 --json

If there are no processes with such PIDs, you can manually remove '/Users/mattseddon/projects/vscode-dvc/demo/.dvc/tmp/exps/rwlock' and try again.
[2023-01-04 21:04:54,438: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[bdcb86ff-40c2-40ec-95e1-9fe950503a4a] succeeded in 197.84534658300004s: None
[2023-01-04 21:05:00,774: INFO/MainProcess] monitor: shutting down due to empty queue.
[2023-01-04 21:05:00,774: INFO/MainProcess] monitor: done
[2023-01-04 21:05:01,494: WARNING/MainProcess] Got shutdown from remote
[2023-01-05 08:01:23,854: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost'
 
 -------------- dvc-exp-bf589e-1@localhost v5.2.7 (dawn-chorus)
--- ***** ----- 
-- ******* ---- macOS-13.1-arm64-arm-64bit 2023-01-05 08:01:23
- *** --- * --- 
- ** ---------- [config]
- ** ---------- .> app:         dvc-exp-local:0x10543e440
- ** ---------- .> transport:   filesystem://localhost//
- ** ---------- .> results:     file:///Users/mattseddon/projects/vscode-dvc/demo/.dvc/tmp/exps/celery/result
- *** --- * --- .> concurrency: 1 (thread)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** ----- 
 -------------- [queues]
                .> celery           exchange=celery(direct) key=celery
                

[tasks]
  . dvc.repo.experiments.queue.tasks.cleanup_exp
  . dvc.repo.experiments.queue.tasks.collect_exp
  . dvc.repo.experiments.queue.tasks.run_exp
  . dvc.repo.experiments.queue.tasks.setup_exp
  . dvc_task.proc.tasks.run

[2023-01-05 08:01:23,872: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost'
[2023-01-05 08:01:23,872: INFO/MainProcess] Connected to filesystem://localhost//
[2023-01-05 08:01:23,874: INFO/MainProcess] dvc-exp-bf589e-1@localhost ready.
[2023-01-05 08:01:23,874: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[b01ffc27-ec1d-45a8-b937-f18095bcce0c] received
[2023-01-05 08:01:24,761: INFO/MainProcess] monitor: watching celery worker 'dvc-exp-bf589e-1@localhost'
ERROR: Failed to collect experiment - 'refs/exps/da/d0c6a05fa0a7aeea1f8686fc4f6c2c1b316b97/swish-prob' is busy, it is being blocked by:
  (PID 10233): /Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/__main__.py exp show -n 3 --json
  (PID 10233): /Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/__main__.py exp show -n 3 --json

If there are no processes with such PIDs, you can manually remove '/Users/mattseddon/projects/vscode-dvc/demo/.dvc/tmp/exps/rwlock' and try again.
[2023-01-05 08:04:42,606: ERROR/MainProcess] Failed to collect experiment
Traceback (most recent call last):
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/repo/experiments/queue/tasks.py", line 70, in collect_exp
    BaseStashQueue.collect_executor(
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/repo/experiments/queue/base.py", line 657, in collect_executor
    results = cls.collect_git(exp, executor, exec_result)
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/funcy/decorators.py", line 45, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/funcy/flow.py", line 127, in retry
    return call()
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/funcy/decorators.py", line 66, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/repo/experiments/queue/base.py", line 635, in collect_git
    with get_exp_rwlock(exp.repo, writes=refs):
  File "/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/rwlock.py", line 209, in rwlock
    _check_blockers(
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/rwlock.py", line 123, in _check_blockers
    raise LockError(
dvc.lock.LockError: 'refs/exps/da/d0c6a05fa0a7aeea1f8686fc4f6c2c1b316b97/swish-prob' is busy, it is being blocked by:
  (PID 10233): /Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/__main__.py exp show -n 3 --json
  (PID 10233): /Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/__main__.py exp show -n 3 --json

If there are no processes with such PIDs, you can manually remove '/Users/mattseddon/projects/vscode-dvc/demo/.dvc/tmp/exps/rwlock' and try again.
[2023-01-05 08:04:42,624: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[b01ffc27-ec1d-45a8-b937-f18095bcce0c] succeeded in 198.75149029099975s: None
[2023-01-05 08:04:43,772: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[534d0af0-126e-4666-872e-318cfe0518c9] received
WARNING: Process '/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/__main__.py exp show -n 3 --json' with (Pid 10233), in RWLock-file 'refs/exps/exec/EXEC_CHECKPOINT' had been killed. Auto remove it from the lock file.
[2023-01-05 08:06:01,764: WARNING/MainProcess] Process '/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/__main__.py exp show -n 3 --json' with (Pid 10233), in RWLock-file 'refs/exps/exec/EXEC_CHECKPOINT' had been killed. Auto remove it from the lock file.
[2023-01-05 08:06:01,820: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[534d0af0-126e-4666-872e-318cfe0518c9] succeeded in 78.04864704200008s: None
[2023-01-05 08:06:02,117: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[d471d22e-a3dd-4d38-99ce-b0ca28d65a95] received
WARNING: Process '/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/__main__.py exp show -n 3 --json' with (Pid 10666), in RWLock-file 'refs/exps/exec/EXEC_CHECKPOINT' had been killed. Auto remove it from the lock file.
[2023-01-05 08:06:44,026: WARNING/MainProcess] Process '/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/__main__.py exp show -n 3 --json' with (Pid 10666), in RWLock-file 'refs/exps/exec/EXEC_CHECKPOINT' had been killed. Auto remove it from the lock file.
[2023-01-05 08:07:13,558: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[d471d22e-a3dd-4d38-99ce-b0ca28d65a95] succeeded in 71.40953562499999s: None
[2023-01-05 08:07:29,496: INFO/MainProcess] monitor: shutting down due to empty queue.
[2023-01-05 08:07:29,498: INFO/MainProcess] monitor: done
[2023-01-05 08:07:29,573: WARNING/MainProcess] Got shutdown from remote
[2023-01-05 13:47:22,411: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost'
 
 -------------- dvc-exp-bf589e-1@localhost v5.2.7 (dawn-chorus)
--- ***** ----- 
-- ******* ---- macOS-13.1-arm64-arm-64bit 2023-01-05 13:47:22
- *** --- * --- 
- ** ---------- [config]
- ** ---------- .> app:         dvc-exp-local:0x1114c2ef0
- ** ---------- .> transport:   filesystem://localhost//
- ** ---------- .> results:     file:///Users/mattseddon/projects/vscode-dvc/demo/.dvc/tmp/exps/celery/result
- *** --- * --- .> concurrency: 1 (thread)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** ----- 
 -------------- [queues]
                .> celery           exchange=celery(direct) key=celery
                

[tasks]
  . dvc.repo.experiments.queue.tasks.cleanup_exp
  . dvc.repo.experiments.queue.tasks.collect_exp
  . dvc.repo.experiments.queue.tasks.run_exp
  . dvc.repo.experiments.queue.tasks.setup_exp
  . dvc_task.proc.tasks.run

[2023-01-05 13:47:22,427: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost'
[2023-01-05 13:47:22,427: INFO/MainProcess] Connected to filesystem://localhost//
[2023-01-05 13:47:22,429: INFO/MainProcess] dvc-exp-bf589e-1@localhost ready.
[2023-01-05 13:47:22,430: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[65c7fff6-73f1-47bf-a327-eaa6c1e452cd] received
[2023-01-05 13:47:23,289: INFO/MainProcess] monitor: watching celery worker 'dvc-exp-bf589e-1@localhost'
[2023-01-05 13:47:23,578: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[65c7fff6-73f1-47bf-a327-eaa6c1e452cd] succeeded in 1.147935499997402s: None
[2023-01-05 13:47:25,446: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[c9c5e68b-dc7f-429e-b988-6c22398cf6b3] received
[2023-01-05 13:47:26,149: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[c9c5e68b-dc7f-429e-b988-6c22398cf6b3] succeeded in 0.7021251249971101s: None
[2023-01-05 13:47:27,455: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[baeb43c9-59de-4d6d-a4a5-cbeb2732e7da] received
[2023-01-05 13:47:28,116: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[baeb43c9-59de-4d6d-a4a5-cbeb2732e7da] succeeded in 0.660743292002735s: None
[2023-01-05 13:47:29,466: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[71440378-22f8-4738-b71f-352190f5c26e] received
[2023-01-05 13:47:30,089: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[71440378-22f8-4738-b71f-352190f5c26e] succeeded in 0.622354916002223s: None
[2023-01-05 13:47:31,471: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[f2ccfa0d-bf9b-4196-9e82-ef97a71151e7] received
[2023-01-05 13:47:32,135: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[f2ccfa0d-bf9b-4196-9e82-ef97a71151e7] succeeded in 0.6635162919992581s: None
[2023-01-05 13:47:37,323: INFO/MainProcess] monitor: shutting down due to empty queue.
[2023-01-05 13:47:37,324: INFO/MainProcess] monitor: done
[2023-01-05 13:47:37,503: WARNING/MainProcess] Got shutdown from remote
[2023-01-05 13:50:15,880: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost'
 
 -------------- dvc-exp-bf589e-1@localhost v5.2.7 (dawn-chorus)
--- ***** ----- 
-- ******* ---- macOS-13.1-arm64-arm-64bit 2023-01-05 13:50:15
- *** --- * --- 
- ** ---------- [config]
- ** ---------- .> app:         dvc-exp-local:0x1079ca470
- ** ---------- .> transport:   filesystem://localhost//
- ** ---------- .> results:     file:///Users/mattseddon/projects/vscode-dvc/demo/.dvc/tmp/exps/celery/result
- *** --- * --- .> concurrency: 1 (thread)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** ----- 
 -------------- [queues]
                .> celery           exchange=celery(direct) key=celery
                

[tasks]
  . dvc.repo.experiments.queue.tasks.cleanup_exp
  . dvc.repo.experiments.queue.tasks.collect_exp
  . dvc.repo.experiments.queue.tasks.run_exp
  . dvc.repo.experiments.queue.tasks.setup_exp
  . dvc_task.proc.tasks.run

[2023-01-05 13:50:15,897: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost'
[2023-01-05 13:50:15,897: INFO/MainProcess] Connected to filesystem://localhost//
[2023-01-05 13:50:15,899: INFO/MainProcess] dvc-exp-bf589e-1@localhost ready.
[2023-01-05 13:50:15,900: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[aae4dbad-b620-4d2d-ada8-2c777289127a] received
[2023-01-05 13:50:16,781: INFO/MainProcess] monitor: watching celery worker 'dvc-exp-bf589e-1@localhost'
[2023-01-05 13:50:35,334: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[aae4dbad-b620-4d2d-ada8-2c777289127a] succeeded in 19.43459495800198s: None
[2023-01-05 13:50:36,992: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[d167368c-42e1-46ae-adcf-74575c316774] received

worker: Warm shutdown (MainProcess)
[2023-01-05 13:52:52,056: INFO/MainProcess] monitor: done
ERROR: Failed to collect experiment - 'refs/exps/da/d0c6a05fa0a7aeea1f8686fc4f6c2c1b316b97/mealy-syce' is busy, it is being blocked by:
  (PID 43657): /Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/__main__.py exp show -n 3 --json
  (PID 43657): /Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/__main__.py exp show -n 3 --json

If there are no processes with such PIDs, you can manually remove '/Users/mattseddon/projects/vscode-dvc/demo/.dvc/tmp/exps/rwlock' and try again.
[2023-01-05 13:53:54,773: ERROR/MainProcess] Failed to collect experiment
Traceback (most recent call last):
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/repo/experiments/queue/tasks.py", line 70, in collect_exp
    BaseStashQueue.collect_executor(
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/repo/experiments/queue/base.py", line 657, in collect_executor
    results = cls.collect_git(exp, executor, exec_result)
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/funcy/decorators.py", line 45, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/funcy/flow.py", line 127, in retry
    return call()
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/funcy/decorators.py", line 66, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/repo/experiments/queue/base.py", line 635, in collect_git
    with get_exp_rwlock(exp.repo, writes=refs):
  File "/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/rwlock.py", line 209, in rwlock
    _check_blockers(
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/rwlock.py", line 123, in _check_blockers
    raise LockError(
dvc.lock.LockError: 'refs/exps/da/d0c6a05fa0a7aeea1f8686fc4f6c2c1b316b97/mealy-syce' is busy, it is being blocked by:
  (PID 43657): /Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/__main__.py exp show -n 3 --json
  (PID 43657): /Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.10/site-packages/dvc/__main__.py exp show -n 3 --json

If there are no processes with such PIDs, you can manually remove '/Users/mattseddon/projects/vscode-dvc/demo/.dvc/tmp/exps/rwlock' and try again.
[2023-01-05 13:53:54,788: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[d167368c-42e1-46ae-adcf-74575c316774] succeeded in 197.79483308299677s: None

We were looking to release some new queue related features this week. We will still ship the code but delay making any public announcements until after we’ve worked out these bugs.

Please LMK if you need any more information. 🙏🏻.

Note: dvc queue start -j X also seems to take a very long amount of time to return through a non-tty.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 2
  • Comments: 15 (15 by maintainers)

Commits related to this issue

Most upvoted comments

The problem

Repository: those with a complex history, and a large number of objects in the Git repo. When: using vscode extension and queue worker running. Phenomenon:

  1. dvc exp show takes dozens of seconds to return.
  2. queue worker may fail because of the repository’s exp rwlock being locked.

Analysis

For now, in the DVC, the exp rwlock only added when:

  1. We add a new experiment to the queue.
  2. During the initialization and the collection of running experiments.
  3. Exp show when we fetch the running experiments to our local workspace

For phenomenon 1 from https://github.com/iterative/dvc/issues/8762#issuecomment-1371754861 above generated by Matt, the problem is that the fetching progress spend 30 secs. This time cost is because of the performance of dulwich but even this problem also happens in fetch progress but unlike in #8477, where the time cost mostly happens when we were sending packages, here the time cost mostly happens when we check_diverge. And this might be the reason that why git gc can solve #8477 but couldn’t solve the problem this time. Here I guess it’s because there might some loops in the history.

Phenomenon 2 is caused by phenomenon 1. Because the vscode extension is continuously querying dvc exp show and it locks the repository for a long time, it will block the initialization and the collection of running experiments. And finally causes a lock acquiring timeout error.

Solutions.

The shortest and simplest answer is that we can directly remove the check_diverge, this solution can be used as a temporary solution for the vscode team to prevent this problem.

The second solution is that we follow #8477 and replace Dulwich here with pygit, The good part is that one merge solves three problems (this #8763, #8477, #8762), but pygit may also bring some new problems.

The third choice is that in dvc exp show we do not fetch results locally but instead directly gather experiments remotely. The good part is that we can also make https://github.com/iterative/dvc/issues/8478#issuecomment-1362395268 possible and solve #8478. The bad part is that #8477 still exists and this remote collection is hard to do if we run experiments in some remote server.

Without researching it, do you expect the cause is the same for https://github.com/iterative/dvc/issues/8763? What is the timeout, and what would be the impact of making it longer (are there downsides)?

Need more investigation, I didn’t know the details of Dulwich, from the error in #8763, it’s a concurrency problem that happens when we create multi workers and do simultaneous operations on the same repository.

Will dulwich hit this error on its own even if we don’t check for it? Can we deal with it then instead of checking beforehand?

First, this code is actually copied from Dulwich itself. I’m not sure what will happen if we remove this.

https://github.com/jelmer/dulwich/blob/a804738d890f0e63618119bc92f257cd3644a2f5/dulwich/porcelain.py#L1303

Because the use case is really simple during the tmp/queue executor. ( We just clone the project, create some new commits, and fetch it back.) So I think it’s safety to remove it.

Because the use case is really simple during the tmp/queue executor. ( We just clone the project, create some new commits, and fetch it back.) So I think it’s safety to remove it.

Great @karajan1001! In that case, let’s start here and follow up with the other suggestions in #8787.

Here’s a reproduction video of what I see:

https://user-images.githubusercontent.com/2308172/211074915-be97b24d-e6fb-4396-9851-7e959ca1dc25.mov

A few takeaways:

  1. dvc exp show hanging always happens at the start of the 2nd experiment, but it hangs for less than 1 minute.
  2. The 2nd experiment starts even while dvc exp show is hung and the logs show that it runs to completion, but the queue stays stuck thinking this 2nd experiment is still running, and dvc exp show does not show any of the checkpoints.
  3. Everything works fine when I run the same workflow without VS Code, so this seems to be a problem generated by the calls being made by VS Code. Edit: If I use watch dvc exp show, I can reproduce the problem, but not when running a typical CLI workflow.
  4. I have tried adding sleep calls to the training code see if it’s just too fast, but that doesn’t seem to help.