vscode-dvc: exp --queue and --run-all can hang and cause future exp show runs to hang
The original Issue has since been solved, but there’s still a related problem which this Issue now tracks. See this comment for info on the still open issue.
Original issue:
I had mentioned this when working on plots, but it seems this is more of a general problem that occurs on master as well.
After running one experiment, live updating experiments ~stops working~ delays for at least 5 and up to 15 checkpoints before dumping all of them at once, sometimes crashing VSCode in the process. You can see the output changes as well, showing nothing until the experiments finishes and pushing all the checkpoints at once. It almost looks like it could be a dvc issue, but I don’t know how it would be considering it’s related to the vscode session.
https://user-images.githubusercontent.com/9111807/134056404-8ef0d288-4184-446d-8d8c-c21f7678245b.mp4
This issue also happens when running dvc exp show
on the CLI outisde of VSCode, with slightly different symptoms on the table itself where updates don’t happen until a bit after the command is finished
https://user-images.githubusercontent.com/9111807/134057377-608bd480-3600-4cde-9ed9-7d88dbe8f12d.mp4
I’ve also check combinations of running from both CLI and VSCode, in both CLI -> VSCode and VSCode -> CLI. Table symptoms reflect the command run second.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 25 (22 by maintainers)
I’m not sure what makes it hang yet, but it wasn’t really designed to be spammed in this way (especially if there’s an infinite loop of more
exp show
calls being made).So
exp show
will touch git refs in this scenario, because it works bygit fetch
ing refs from each of the temporary workspaces (where your 4 experiments are running) into the main git workspace.What (directories/files) does the file watcher currently trigger
exp show
on? Is it possible to exclude paths from being watched? It seems like we may need to have a discussion on what exactly needs to be monitored (for the repo to be considered “changed”).@rogermparent see https://github.com/iterative/vscode-dvc/issues/828#issuecomment-996437072 I could only recreate the issue using the queue. We might want to update the title of the issue but we should leave it open for now.
I am now seeing a similar issue where
exp show
hung for a long time, here are the steps that I went throughexp show
began running in a loopexp show
is now “touching”vscode-dvc/.git/refs/exps/93/3131de80b564742afb0795f683af6b955b327b/exp-a593b
&vscode-dvc/.git/refs/exps/exec/EXEC_CHECKPOINT
every time it runs (even after runningdvc gc -T
) this is causing an infinite loop ofexp show
commandsstill investigating
I ran a
git clean -fxd
and reinstalled all dependencies before attempting that second recording, and just verified with those commands that my.env
is up-to-date. It’s worth noting that extension dev host + dvc reaches 13/16gb of my RAM, but doesn’t seem to cross the threshold.Great catch! It looks like
exp show
is starting alongsideexp run
, and idling until some point in the middle ofexp run
without ever failing like it does on your examples. Maybe I spoke too soon, and this could be a Linux-specific DVC issue related to commands running into each other- specificallyrun
andshow
?I ran GC quite a few times in the process, running it after that example (and a couple experiments later) only cleaned 4 experiments.
@rogermparent I cannot reproduce:
https://user-images.githubusercontent.com/37993418/134101351-be39868d-13b0-42bb-b543-5cf4f4d217b0.mov
https://user-images.githubusercontent.com/37993418/134101373-803e32ca-b78c-47c6-9e48-93460f5dd3ef.mov
When was the last time that you upgraded your
.env
(source .env/bin/activate && pip install -r requirements.txt -U
)?Let me try to replicate with higher epochs and non-continuation, I remember the same thing happening when I did so when trying plots, but it’s always good to confirm.
To me it looks like an issue with restarting from existing checkpoints, what happens if you increase the number of epochs to 15 or some number like that? I think that the 3 epochs happen so quickly that it seems like “nothing is sent until the experiment finishes”. What happens if you change the params so that you are forcing another experiment to start?
Here is a demo from another repo:
https://user-images.githubusercontent.com/37993418/134087833-ed91d7fc-18e2-4dc3-b4e4-cb6f601fc0da.mov
We still have issues where commands are running into each other so things can be slow to update but live updates are not broken.