vscode-dvc: exp --queue and --run-all can hang and cause future exp show runs to hang

The original Issue has since been solved, but there’s still a related problem which this Issue now tracks. See this comment for info on the still open issue.


Original issue:

I had mentioned this when working on plots, but it seems this is more of a general problem that occurs on master as well.

After running one experiment, live updating experiments ~stops working~ delays for at least 5 and up to 15 checkpoints before dumping all of them at once, sometimes crashing VSCode in the process. You can see the output changes as well, showing nothing until the experiments finishes and pushing all the checkpoints at once. It almost looks like it could be a dvc issue, but I don’t know how it would be considering it’s related to the vscode session.

https://user-images.githubusercontent.com/9111807/134056404-8ef0d288-4184-446d-8d8c-c21f7678245b.mp4

This issue also happens when running dvc exp show on the CLI outisde of VSCode, with slightly different symptoms on the table itself where updates don’t happen until a bit after the command is finished

https://user-images.githubusercontent.com/9111807/134057377-608bd480-3600-4cde-9ed9-7d88dbe8f12d.mp4

I’ve also check combinations of running from both CLI and VSCode, in both CLI -> VSCode and VSCode -> CLI. Table symptoms reflect the command run second.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 25 (22 by maintainers)

Most upvoted comments

@pmrowla we can (should) definitely discuss that. Any ideas on why exp show is hanging?

I’m not sure what makes it hang yet, but it wasn’t really designed to be spammed in this way (especially if there’s an infinite loop of more exp show calls being made).

I am now seeing a similar issue where exp show hung for a long time, here are the steps that I went through

  1. my repo is now in this weird state where exp show is now “touching” vscode-dvc/.git/refs/exps/93/3131de80b564742afb0795f683af6b955b327b/exp-a593b & vscode-dvc/.git/refs/exps/exec/EXEC_CHECKPOINT every time it runs (even after running dvc gc -T) this is causing an infinite loop of exp show commands

Note: At some point in the early stages of fighting against this bug I got the repo in a state where dvc exp show --show-json was triggering the file watcher that causes dvc exp show --show-json to run by touching .git/refs/exps/exec/EXEC_CHECKPOINT and another ref and my computer nearly melted.

So exp show will touch git refs in this scenario, because it works by git fetching refs from each of the temporary workspaces (where your 4 experiments are running) into the main git workspace.

What (directories/files) does the file watcher currently trigger exp show on? Is it possible to exclude paths from being watched? It seems like we may need to have a discussion on what exactly needs to be monitored (for the repo to be considered “changed”).

@rogermparent see https://github.com/iterative/vscode-dvc/issues/828#issuecomment-996437072 I could only recreate the issue using the queue. We might want to update the title of the issue but we should leave it open for now.

I am now seeing a similar issue where exp show hung for a long time, here are the steps that I went through

  1. queue 4 experiments
  2. run all experiments
  3. after first experiment completed exp show began to hang
  4. after killing the session exp show began running in a loop
  5. my repo is now in this weird state where exp show is now “touching” vscode-dvc/.git/refs/exps/93/3131de80b564742afb0795f683af6b955b327b/exp-a593b & vscode-dvc/.git/refs/exps/exec/EXEC_CHECKPOINT every time it runs (even after running dvc gc -T) this is causing an infinite loop of exp show commands

still investigating

When was the last time that you upgraded your .env (source .env/bin/activate && pip install -r requirements.txt -U)?

I ran a git clean -fxd and reinstalled all dependencies before attempting that second recording, and just verified with those commands that my .env is up-to-date. It’s worth noting that extension dev host + dvc reaches 13/16gb of my RAM, but doesn’t seem to cross the threshold.

wait a minute, HOW is exp show taking 121s to run?

When was the last time you garbage collected experiments… even then WAT! How long does that action take in the CLI?

Great catch! It looks like exp show is starting alongside exp run, and idling until some point in the middle of exp run without ever failing like it does on your examples. Maybe I spoke too soon, and this could be a Linux-specific DVC issue related to commands running into each other- specifically run and show?

I ran GC quite a few times in the process, running it after that example (and a couple experiments later) only cleaned 4 experiments.

Let me try to replicate with higher epochs and non-continuation, I remember the same thing happening when I did so when trying plots, but it’s always good to confirm.

To me it looks like an issue with restarting from existing checkpoints, what happens if you increase the number of epochs to 15 or some number like that? I think that the 3 epochs happen so quickly that it seems like “nothing is sent until the experiment finishes”. What happens if you change the params so that you are forcing another experiment to start?

Here is a demo from another repo:

https://user-images.githubusercontent.com/37993418/134087833-ed91d7fc-18e2-4dc3-b4e4-cb6f601fc0da.mov

We still have issues where commands are running into each other so things can be slow to update but live updates are not broken.