dvc: add: support rollback/recovery from partial/failed dvc add

If the user does a long running dvc add directory that dies somewhere in the middle, there is no obvious (to the user) way to recover from the state where

  1. DVC has not generated a .dvc file
  2. half of the users files are still in the workspace (and have not been moved into cache)
  3. half of the user’s files have been moved into cache, but have not been checked out/linked (and now appear to be lost)

Currently it’s possible to recover from this state manually as long as we have the .dir file for the original complete directory, but not in a straightforward way

discord context https://discord.com/channels/485586884165107732/485596304961962003/872423982463340555

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 17 (16 by maintainers)

Commits related to this issue

Most upvoted comments

I’m bumping the priority here, we already have a guideline against breaking the user’s workspace, so we need to realize the correct implementation.

Another user nearly lost their ~140GB of data. We should take data loss seriously. Even if we don’t add a support for transactions (rollback/commit), we should not break user’s workspace by moving files to the cache.

Suggestion: at least prioritize to implement a progress bar to show the ETA. Lack of pb was the reason I was interupted the add process. Thanks again for the support for getting the data back.

Another user on discord running into this: https://discord.com/channels/485586884165107732/485596304961962003/900792694023028766

Also, note that we’ve stopped poisoning our local cache with .dir, but our staging db is still memory-only, so you are no longer able to recover the metadata 🙁

Unfortunately, we are not ready to make staging persistent right now, and we clearly shouldn’t bring back poisoning, so instead we should take a closer look at the approach with trying to reflink/hardlink/copy(in this order and without symlink) instead of moving during odb.add in dvc/fs/utils.py::transfer(). These days our object transferring (talking about dvc/objects/transfer.py one) and checkingout are separate (Kudos @pmrowla for unifying object save() and object transfer() 🙏 ) and in more-or-less good shape that they should be able to handle the aforementioned approach instead of move. This will allow us to be failure-proof during transfer(before we’ve saved .dir into .dvc/cache, user’s workspace will be intact) and during later failures (e.g. checkout) we will already have .dir that could be used for recovery (by us in the midterm, but as a hack in the shortterm). So this might be a good partial solution.

Here’s the suggestion on how to recover from a failed dvc add from @pmrowla, added for better discovery:

for the record, in this situation where dvc add directory fails or is aborted midway through the copy/move files to cache operation, but before we generate a .dvc file, no data loss has actually occurred. Files which appear to be missing are in .dvc/cache but have not been re-linked back into the workspace.

recovering from this state and completing the partial add can be done manually:

the .dir file for the directory is generated and saved in .dvc/cache prior to moving any files, and the relevant file can be found (eventually) by inspecting any dir files which are present in the cache:

ls .dvc/cache/**/*.dir

at this point, the users workspace will contain some portion of files which have not yet been moved into the DVC cache, while the DVC cache contains the remaining portion (which appear to have been missing or lost in the workspace)

to resolve the issue and “complete” the original add, you can just do

dvc add directory

which will add the remaining workspace (“non-missing”) files to the DVC cache, and then generate a directory.dvc file which contains a hash for that partial workspace state

at this point, you can just manually edit directory.dvc so that it only contains

outs:
- md5: <original_dir_hash>.dir
  path: directory

where original_dir_hash is the directory hash which was identified at the beginning of the recovery process

from here

dvc checkout

will checkout the entire/original dataset back into the workspace

@karajan1001 Good point! I meant that we could use hardlink temporarily, just so that we have an easier time rolling back if something goes wrong.

Related: https://github.com/iterative/dvc/discussions/6252.

DVC when committing the cache, moves the files to the cache and then after that is complete, relinks the file back to the workspace. This might result in data loss if it happens to fail in between.