dvc: `dvc queue`: unexpected behaviour
Bug Report
Description
Whilst checking out the new dvc queue
command I have run into some unexpected behaviour. I won’t duplicate the steps to reproduce here but after queueing and running experiments I have run in to two different issues.
VS Code demo project: dvc queue status
returning ERROR: Invalid experiment '{entry.stash_rev[:7]}'.
(produced when running with the extension)
example-get-started
: dvc queue status
returning
Task Name Created Status
f3d69ee 02:17 PM Success
08ccb05 02:17 PM Success
ERROR: unexpected error - Extra data: line 1 column 56 (char 55)
(produced without having the extension involved).
In both instances this resulted in the HEAD baseline entry being dropped from the exp show
data:
example-get-started example
❯ dvc exp show --show-json
{
"workspace": {
"baseline": {
"data": {
"timestamp": null,
"params": {
"params.yaml": {
"data": {
"prepare": {
"split": 0.21,
"seed": 20170428
},
"featurize": {
"max_features": 200,
"ngrams": 2
},
"train": {
"seed": 20170428,
"n_est": 50,
"min_split": 0.01
}
}
}
},
"deps": {
"data/data.xml": {
"hash": "22a1a2931c8370d3aeedd7183606fd7f",
"size": 14445097,
"nfiles": null
},
"src/prepare.py": {
"hash": "f09ea0c15980b43010257ccb9f0055e2",
"size": 1576,
"nfiles": null
},
"data/prepared": {
"hash": "153aad06d376b6595932470e459ef42a.dir",
"size": 8437363,
"nfiles": 2
},
"src/featurization.py": {
"hash": "e0265fc22f056a4b86d85c3056bc2894",
"size": 2490,
"nfiles": null
},
"data/features": {
"hash": "f35d4cc2c552ac959ae602162b8543f3.dir",
"size": 2232588,
"nfiles": 2
},
"src/train.py": {
"hash": "c3961d777cfbd7727f9fde4851896006",
"size": 967,
"nfiles": null
},
"model.pkl": {
"hash": "46865edbf3d62fc5c039dd9d2b0567a4",
"size": 1763725,
"nfiles": null
},
"src/evaluate.py": {
"hash": "44e714021a65edf881b1716e791d7f59",
"size": 2346,
"nfiles": null
}
},
"outs": {
"data/prepared": {
"hash": "153aad06d376b6595932470e459ef42a.dir",
"size": 8437363,
"nfiles": 2,
"use_cache": true,
"is_data_source": false
},
"data/features": {
"hash": "f35d4cc2c552ac959ae602162b8543f3.dir",
"size": 2232588,
"nfiles": 2,
"use_cache": true,
"is_data_source": false
},
"model.pkl": {
"hash": "46865edbf3d62fc5c039dd9d2b0567a4",
"size": 1763725,
"nfiles": null,
"use_cache": true,
"is_data_source": false
},
"data/data.xml": {
"hash": "22a1a2931c8370d3aeedd7183606fd7f",
"size": 14445097,
"nfiles": null,
"use_cache": true,
"is_data_source": true
}
},
"queued": false,
"running": false,
"executor": null,
"metrics": {
"evaluation.json": {
"data": {
"avg_prec": 0.9249974999612706,
"roc_auc": 0.9460213440787918
}
}
}
}
}
},
"f3d69eedda6b1c051b115523cf5c6c210490d0ea": {
"baseline": {
"data": {
"timestamp": "2022-07-13T14:17:20",
"params": {
"params.yaml": {
"data": {
"prepare": {
"split": 0.21,
"seed": 20170428
},
"featurize": {
"max_features": 200,
"ngrams": 2
},
"train": {
"seed": 20170428,
"n_est": 50,
"min_split": 0.01
}
}
}
},
"deps": {
"data/data.xml": {
"hash": "22a1a2931c8370d3aeedd7183606fd7f",
"size": 14445097,
"nfiles": null
},
"src/prepare.py": {
"hash": "f09ea0c15980b43010257ccb9f0055e2",
"size": 1576,
"nfiles": null
},
"data/prepared": {
"hash": "153aad06d376b6595932470e459ef42a.dir",
"size": 8437363,
"nfiles": 2
},
"src/featurization.py": {
"hash": "e0265fc22f056a4b86d85c3056bc2894",
"size": 2490,
"nfiles": null
},
"data/features": {
"hash": "f35d4cc2c552ac959ae602162b8543f3.dir",
"size": 2232588,
"nfiles": 2
},
"src/train.py": {
"hash": "c3961d777cfbd7727f9fde4851896006",
"size": 967,
"nfiles": null
},
"model.pkl": {
"hash": "46865edbf3d62fc5c039dd9d2b0567a4",
"size": 1763725,
"nfiles": null
},
"src/evaluate.py": {
"hash": "44e714021a65edf881b1716e791d7f59",
"size": 2346,
"nfiles": null
}
},
"outs": {
"data/prepared": {
"hash": "153aad06d376b6595932470e459ef42a.dir",
"size": 8437363,
"nfiles": 2,
"use_cache": true,
"is_data_source": false
},
"data/features": {
"hash": "f35d4cc2c552ac959ae602162b8543f3.dir",
"size": 2232588,
"nfiles": 2,
"use_cache": true,
"is_data_source": false
},
"model.pkl": {
"hash": "46865edbf3d62fc5c039dd9d2b0567a4",
"size": 1763725,
"nfiles": null,
"use_cache": true,
"is_data_source": false
},
"data/data.xml": {
"hash": "22a1a2931c8370d3aeedd7183606fd7f",
"size": 14445097,
"nfiles": null,
"use_cache": true,
"is_data_source": true
}
},
"queued": false,
"running": false,
"executor": null,
"metrics": {
"evaluation.json": {
"data": {
"avg_prec": 0.9249974999612706,
"roc_auc": 0.9460213440787918
}
}
}
}
}
}
}
Reproduce
- clone
example-get-started
- add
git+https://github.com/iterative/dvc
tosrc/requirements.txt
- create venv, source activate script and install requirements
dvc pull
- change params.yaml and queue x2 with
dvc exp run --queue
dvc queue start -j 2
dvc exp show
dvc queue status
dvc exp show
When recreating this I can see that both experiments were successful in dvc queue status
but the second one has not made it into the table. Final results:
❯ dvc queue status
Task Name Created Status
9d22751 02:50 PM Success
962c834 02:50 PM Success
Worker status: 0 active, 0 idle
First column of exp show
:
workspace
bigrams-experiment
└── 65584bd [exp-c88e8]
and the shas don’t match?
Expected
Should be able to run exp show
& queue status
in parallel with the execution of tasks from the queue.
Environment information
Output of dvc doctor
:
$ dvc doctor
DVC version: 2.13.1.dev87+gc2668110
---------------------------------
Platform: Python 3.8.9 on macOS-12.2.1-arm64-arm-64bit
Supports:
webhdfs (fsspec = 2022.5.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.5.1),
https (aiohttp = 3.8.1, aiohttp-retry = 2.5.1)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: https
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git
Additional Information (if any):
Please let me know if you need anything else from me. Thank you.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 18 (18 by maintainers)
Commits related to this issue
- queue status: ERROR: Invalid experiment fix: #8014 > ERROR: Invalid experiment '{entry.stash_rev[:7]}'. This happens when the queue task failed from a scm error. {"exc_type": "GitMergeError", "exc_... — committed to karajan1001/dvc by karajan1001 2 years ago
- queue status: ERROR: Invalid experiment fix: #8014 > ERROR: Invalid experiment '{entry.stash_rev[:7]}'. This happens when the queue task failed from a scm error. {"exc_type": "GitMergeError", "exc_... — committed to karajan1001/dvc by karajan1001 2 years ago
- queue status: ERROR: Invalid experiment fix: #8014 > ERROR: Invalid experiment '{entry.stash_rev[:7]}'. This happens when the queue task failed from a scm error. {"exc_type": "GitMergeError", "exc_... — committed to iterative/dvc by karajan1001 2 years ago
Sound like related to https://github.com/iterative/dvc-task/issues/73. I tried several times but didn’t meet this. I guess it is not related to the experiments in old versions. And related to 1. concurrency 2. checkpoint. I can repair the error message ‘{entry.stash_rev[:7]}’ first to see what
stash_rev
value it is.Tl;dr - I can recreate the issue by using
dvc queue start -j 2
. As j > 1 is currently experimental we can probably close this.I can definitely recreate it. I just ran into it again:
When trying to clean up experiments after getting that warning:
This will be an issue in the extension because of errors generate a popup that the user sees.
Deleting
.dvc/tmp/exps
gets rid of the error altogether.Repro steps:
2.11.0
2.15.0
.dvc queue start -j 2
dvc exp show --show-json
almost immediately after starting the queue (as per extension).dvc queue status
returnsERROR: Invalid experiment '{entry.stash_rev[:7]}'.
Even these repro steps are a bit hit or miss. From 3 attempts I hit the error and with a missing experiment 2 times.
I can also recreate just by using steps 4-8 (no upgrade needed).
The error is probably caused by 5+6. As j > 1 is a known issue we can probably close this.
Sorry, @karajan1001 I’ve been 100% occupied with integrating
data status
. I will get back to you soon.Yes, I have previously recreated this by having completed experiments in the workspace and then upgrading the CLI, queuing a new experiment and running the queue. The project that I’ve been testing with has checkpoints as well.
It should also be noted here that after upgrading the CLI there is no way back without removing
.dvc/tmp/exps
. That is definitely less of a concern than being able to seamlessly upgrade.Hope this helps 👍🏻.