dvc: exp: Checkpoints created during `dvc exp run --temp` run are lost after failure (e.g., `kill -9`)

Bug Report

Description

I have a long running training stage in my dvc.yaml, which uses DVCLive to track metrics and experiment checkpoints by specifying checkpoint: true for the PyTorch model .ckpt file created by PyTorch Lightnings ModelCheckpoint callback. When executing the training using dvc exp run --temp, it is run inside a temp folder created in .dvc/tmp/exps/standalone/. All checkpoint Git objects are stored under .dvc/tmp/exps/standalone/tmpXXX/.git/objects/. When the training process is interrupted (e.g., OOM, shared memory issue, failure to create new threads due to OS limits), DVC reports the error that ERROR: failed to reproduce 'train': failed to run: ... and exits. While doing so, it deletes the temp directory in .dvc/tmp/exps/standalone/ and along with it all previously created checkpoints. I cannot find the same checkpoint objects in the .git/objects folder of the workspace and am unable to recover those checkpoints.

Reproduce

  1. Create a dvc.yaml with train stage running a training script using DVCLive and checkpoints.
  2. Execute stage with dvc exp run --temp train.
  3. Wait a number of epochs until a few checkpoints were stored.
  4. Kill training process with kill -9.
  5. Check that .dvc/tmp/exps/standalone/tmpXXX folder is gone. No checkpoint objects in workspace (e.g., dvc exp show).

Expected

Checkpoints should be preserved to be able to recover from failures such as the ones mentioned in the description.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.31.0 (rpm)
---------------------------------
Platform: Python 3.8.3 on Linux-3.10.0-1160.15.2.el7.x86_64-x86_64-with-glibc2.14
Subprojects:

Supports:
        azure (adlfs = None, knack = 0.10.0, azure-identity = 1.11.0),
        gdrive (pydrive2 = 1.14.0),
        gs (gcsfs = None),
        hdfs (fsspec = None, pyarrow = 9.0.0),
        http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        oss (ossfs = 2021.8.0),
        s3 (s3fs = None, boto3 = 1.24.59),
        ssh (sshfs = 2022.6.0),
        webdav (webdav4 = 0.9.7),
        webdavs (webdav4 = 0.9.7),
        webhdfs (fsspec = None)
Cache types: hardlink, symlink
Cache directory: xfs on /dev/md124
Caches: local, s3
Remotes: s3, s3
Workspace directory: xfs on /dev/md124
Repo: dvc (subdir), git

Additional Information (if any):

When interrupting the experiment with CTRL+C, the training script is set up to still return a zero exit code such that DVC considers the experiment as successfully executed. In this case, I expect the checkpoints to be preserved before the temp directory is being deleted (but I haven’t tested this yet).

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 15 (11 by maintainers)

Commits related to this issue

Most upvoted comments

I can reproduce with --temp but not with --queue. Can we make --temp behave like --queue?

Failed queued experiments are shown now as failed in the table and through the exp queue commands, but we are not saving any git commits for those failed exps, you just get a row showing which run failed (and you can now use queue logs to see the error logs as to why it failed)

But this only applies to --queue’d experiments.

Ah, what I should have said was maybe “fetched into git”. Workspace in this case is really referring to “main dvc/git repo” vs “tempdir dvc/dvc repo” (which is used to run --queue and --temp exps outside of the main repo). I’m not talking about the actual local workspace.

Basically the issue is that we are losing the successful checkpoint iterations’ git commits+exp ref that get generated by exp run --temp since we do not fetch them on failure (even though we do fetch them on failure for exp run --queue). This is not related to whether or not the changes actually get applied in the user’s workspace directory afterwards

I think there is some confusion over what the desired behavior is right now. When using --queue, the current behavior is that any successful checkpoints will be preserved (and fetched into the main dvc repo/workspace from the temp execution workspace). If a later checkpoint iteration fails, we do not save that failed state, but the previous successful checkpoint iterations will still be available in the exp show table.

When using --temp we do not preserve the successful checkpoint iterations at all, due to the bug that @karajan1001 described: https://github.com/iterative/dvc/issues/8612#issuecomment-1333615852

My understanding is that the desired behavior here is to make --temp behave the same way as --queue for checkpoints. So successfully run iterations will still be available, but we do not actually need to save the failed/intermediate repo state of the final (unsuccessful/cancelled) iteration.

I think @karajan1001’s latest question was regarding how to handle actually saving that failed/intermediate final state. In my opinion, this is not something we should be addressing right now, it would be better to handle it in the future if/when we are able to revisit checkpoint behavior in general.

But for now, I think limiting the scope of this issue to “make --temp behave consistently with --queue” is what we should be focusing on.

The problem is in --queue

https://github.com/iterative/dvc/blob/aa2e830511bb5f93b98d74f3283a27da564d7e67/dvc/repo/experiments/queue/tasks.py#L110-L115

If the run_signature was failed, the collect_exp and cleanup_exp will still be run.

while in --temp

https://github.com/iterative/dvc/blob/aa2e830511bb5f93b98d74f3283a27da564d7e67/dvc/repo/experiments/queue/workspace.py#L117-L134

the collect_executor will only be run if the training is successful.