dvc: `dvc queue`: unexpected behaviour

Bug Report

Description

Whilst checking out the new dvc queue command I have run into some unexpected behaviour. I won’t duplicate the steps to reproduce here but after queueing and running experiments I have run in to two different issues.

VS Code demo project: dvc queue status returning ERROR: Invalid experiment '{entry.stash_rev[:7]}'. (produced when running with the extension) example-get-started: dvc queue status returning

Task     Name    Created    Status
f3d69ee          02:17 PM   Success
08ccb05          02:17 PM   Success

ERROR: unexpected error - Extra data: line 1 column 56 (char 55)

(produced without having the extension involved).

In both instances this resulted in the HEAD baseline entry being dropped from the exp show data:

example-get-started example

❯ dvc exp show --show-json
{
  "workspace": {
    "baseline": {
      "data": {
        "timestamp": null,
        "params": {
          "params.yaml": {
            "data": {
              "prepare": {
                "split": 0.21,
                "seed": 20170428
              },
              "featurize": {
                "max_features": 200,
                "ngrams": 2
              },
              "train": {
                "seed": 20170428,
                "n_est": 50,
                "min_split": 0.01
              }
            }
          }
        },
        "deps": {
          "data/data.xml": {
            "hash": "22a1a2931c8370d3aeedd7183606fd7f",
            "size": 14445097,
            "nfiles": null
          },
          "src/prepare.py": {
            "hash": "f09ea0c15980b43010257ccb9f0055e2",
            "size": 1576,
            "nfiles": null
          },
          "data/prepared": {
            "hash": "153aad06d376b6595932470e459ef42a.dir",
            "size": 8437363,
            "nfiles": 2
          },
          "src/featurization.py": {
            "hash": "e0265fc22f056a4b86d85c3056bc2894",
            "size": 2490,
            "nfiles": null
          },
          "data/features": {
            "hash": "f35d4cc2c552ac959ae602162b8543f3.dir",
            "size": 2232588,
            "nfiles": 2
          },
          "src/train.py": {
            "hash": "c3961d777cfbd7727f9fde4851896006",
            "size": 967,
            "nfiles": null
          },
          "model.pkl": {
            "hash": "46865edbf3d62fc5c039dd9d2b0567a4",
            "size": 1763725,
            "nfiles": null
          },
          "src/evaluate.py": {
            "hash": "44e714021a65edf881b1716e791d7f59",
            "size": 2346,
            "nfiles": null
          }
        },
        "outs": {
          "data/prepared": {
            "hash": "153aad06d376b6595932470e459ef42a.dir",
            "size": 8437363,
            "nfiles": 2,
            "use_cache": true,
            "is_data_source": false
          },
          "data/features": {
            "hash": "f35d4cc2c552ac959ae602162b8543f3.dir",
            "size": 2232588,
            "nfiles": 2,
            "use_cache": true,
            "is_data_source": false
          },
          "model.pkl": {
            "hash": "46865edbf3d62fc5c039dd9d2b0567a4",
            "size": 1763725,
            "nfiles": null,
            "use_cache": true,
            "is_data_source": false
          },
          "data/data.xml": {
            "hash": "22a1a2931c8370d3aeedd7183606fd7f",
            "size": 14445097,
            "nfiles": null,
            "use_cache": true,
            "is_data_source": true
          }
        },
        "queued": false,
        "running": false,
        "executor": null,
        "metrics": {
          "evaluation.json": {
            "data": {
              "avg_prec": 0.9249974999612706,
              "roc_auc": 0.9460213440787918
            }
          }
        }
      }
    }
  },
  "f3d69eedda6b1c051b115523cf5c6c210490d0ea": {
    "baseline": {
      "data": {
        "timestamp": "2022-07-13T14:17:20",
        "params": {
          "params.yaml": {
            "data": {
              "prepare": {
                "split": 0.21,
                "seed": 20170428
              },
              "featurize": {
                "max_features": 200,
                "ngrams": 2
              },
              "train": {
                "seed": 20170428,
                "n_est": 50,
                "min_split": 0.01
              }
            }
          }
        },
        "deps": {
          "data/data.xml": {
            "hash": "22a1a2931c8370d3aeedd7183606fd7f",
            "size": 14445097,
            "nfiles": null
          },
          "src/prepare.py": {
            "hash": "f09ea0c15980b43010257ccb9f0055e2",
            "size": 1576,
            "nfiles": null
          },
          "data/prepared": {
            "hash": "153aad06d376b6595932470e459ef42a.dir",
            "size": 8437363,
            "nfiles": 2
          },
          "src/featurization.py": {
            "hash": "e0265fc22f056a4b86d85c3056bc2894",
            "size": 2490,
            "nfiles": null
          },
          "data/features": {
            "hash": "f35d4cc2c552ac959ae602162b8543f3.dir",
            "size": 2232588,
            "nfiles": 2
          },
          "src/train.py": {
            "hash": "c3961d777cfbd7727f9fde4851896006",
            "size": 967,
            "nfiles": null
          },
          "model.pkl": {
            "hash": "46865edbf3d62fc5c039dd9d2b0567a4",
            "size": 1763725,
            "nfiles": null
          },
          "src/evaluate.py": {
            "hash": "44e714021a65edf881b1716e791d7f59",
            "size": 2346,
            "nfiles": null
          }
        },
        "outs": {
          "data/prepared": {
            "hash": "153aad06d376b6595932470e459ef42a.dir",
            "size": 8437363,
            "nfiles": 2,
            "use_cache": true,
            "is_data_source": false
          },
          "data/features": {
            "hash": "f35d4cc2c552ac959ae602162b8543f3.dir",
            "size": 2232588,
            "nfiles": 2,
            "use_cache": true,
            "is_data_source": false
          },
          "model.pkl": {
            "hash": "46865edbf3d62fc5c039dd9d2b0567a4",
            "size": 1763725,
            "nfiles": null,
            "use_cache": true,
            "is_data_source": false
          },
          "data/data.xml": {
            "hash": "22a1a2931c8370d3aeedd7183606fd7f",
            "size": 14445097,
            "nfiles": null,
            "use_cache": true,
            "is_data_source": true
          }
        },
        "queued": false,
        "running": false,
        "executor": null,
        "metrics": {
          "evaluation.json": {
            "data": {
              "avg_prec": 0.9249974999612706,
              "roc_auc": 0.9460213440787918
            }
          }
        }
      }
    }
  }
}

Reproduce

clone example-get-started
add git+https://github.com/iterative/dvc to src/requirements.txt
create venv, source activate script and install requirements
dvc pull
change params.yaml and queue x2 with dvc exp run --queue
dvc queue start -j 2
dvc exp show
dvc queue status
dvc exp show

When recreating this I can see that both experiments were successful in dvc queue status but the second one has not made it into the table. Final results:

❯ dvc queue status 
Task     Name    Created    Status
9d22751          02:50 PM   Success
962c834          02:50 PM   Success

Worker status: 0 active, 0 idle

First column of exp show:

  workspace
  bigrams-experiment
  └── 65584bd [exp-c88e8]

and the shas don’t match?

Expected

Should be able to run exp show & queue status in parallel with the execution of tasks from the queue.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.13.1.dev87+gc2668110 
---------------------------------
Platform: Python 3.8.9 on macOS-12.2.1-arm64-arm-64bit
Supports:
        webhdfs (fsspec = 2022.5.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.5.1),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.5.1)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: https
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git

Additional Information (if any):

Please let me know if you need anything else from me. Thank you.

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 18 (18 by maintainers)

Commits related to this issue

queue status: ERROR: Invalid experiment fix: #8014 > ERROR: Invalid experiment '{entry.stash_rev[:7]}'. This happens when the queue task failed from a scm error. {"exc_type": "GitMergeError", "exc_... — committed to karajan1001/dvc by karajan1001 2 years ago
queue status: ERROR: Invalid experiment fix: #8014 > ERROR: Invalid experiment '{entry.stash_rev[:7]}'. This happens when the queue task failed from a scm error. {"exc_type": "GitMergeError", "exc_... — committed to karajan1001/dvc by karajan1001 2 years ago
queue status: ERROR: Invalid experiment fix: #8014 > ERROR: Invalid experiment '{entry.stash_rev[:7]}'. This happens when the queue task failed from a scm error. {"exc_type": "GitMergeError", "exc_... — committed to iterative/dvc by karajan1001 2 years ago

Most upvoted comments

Sound like related to https://github.com/iterative/dvc-task/issues/73. I tried several times but didn’t meet this. I guess it is not related to the experiments in old versions. And related to 1. concurrency 2. checkpoint. I can repair the error message ‘{entry.stash_rev[:7]}’ first to see what stash_rev value it is.

karajan1001 on Jul 28, 2022

Tl;dr - I can recreate the issue by using dvc queue start -j 2. As j > 1 is currently experimental we can probably close this.

I was unable to reproduce this one, and it’s unclear whether it should be a priority.

I can definitely recreate it. I just ran into it again:

When trying to clean up experiments after getting that warning:

❯ dvc exp gc -f --all-tags 
WARNING: This will remove all experiments except those derived from the workspace and all git tags of the current repo. Run queued experiments will be removed.
ERROR: Invalid experiment '{entry.stash_rev[:7]}'.

This will be an issue in the extension because of errors generate a popup that the user sees.

Deleting .dvc/tmp/exps gets rid of the error altogether.

Repro steps:

Using checkpoint-based experiments with 2.11.0
Run an experiment in the workspace.
Upgrade to 2.15.0.
Queue two experiments with different params.
dvc queue start -j 2
run dvc exp show --show-json almost immediately after starting the queue (as per extension).
One experiment will run, the other will disappear.
dvc queue status returns ERROR: Invalid experiment '{entry.stash_rev[:7]}'.

Even these repro steps are a bit hit or miss. From 3 attempts I hit the error and with a missing experiment 2 times.

I can also recreate just by using steps 4-8 (no upgrade needed).

The error is probably caused by 5+6. As j > 1 is a known issue we can probably close this.

mattseddon on Jul 27, 2022

Sorry, @karajan1001 I’ve been 100% occupied with integrating data status. I will get back to you soon.

mattseddon on Jul 27, 2022

For the ERROR: Invalid experiment '{entry.stash_rev[:7]}'. I can’t reproduce it, is there any reproducing steps? or it’s a compatible issue that only exists in dealing with previously generated exps?

Yes, I have previously recreated this by having completed experiments in the workspace and then upgrading the CLI, queuing a new experiment and running the queue. The project that I’ve been testing with has checkpoints as well.

It should also be noted here that after upgrading the CLI there is no way back without removing .dvc/tmp/exps. That is definitely less of a concern than being able to seamlessly upgrade.

Hope this helps 👍🏻.

mattseddon on Jul 21, 2022