dvc: add: support rollback/recovery from partial/failed dvc add
If the user does a long running dvc add directory that dies somewhere in the middle, there is no obvious (to the user) way to recover from the state where
- DVC has not generated a .dvc file
- half of the users files are still in the workspace (and have not been moved into cache)
- half of the user’s files have been moved into cache, but have not been checked out/linked (and now appear to be lost)
Currently it’s possible to recover from this state manually as long as we have the .dir
file for the original complete directory, but not in a straightforward way
discord context https://discord.com/channels/485586884165107732/485596304961962003/872423982463340555
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 17 (16 by maintainers)
Commits related to this issue
- fs: transfer: try reflink If supported, reflink, will provide the fastest transfer possible. That being said, right now we use `move` in most of the places, which still takes presidence. Pre-requisi... — committed to efiop/dvc by efiop 3 years ago
- fs: transfer: try reflink (#6853) If supported, reflink, will provide the fastest transfer possible. That being said, right now we use `move` in most of the places, which still takes presidence. ... — committed to iterative/dvc by efiop 3 years ago
I’m bumping the priority here, we already have a guideline against breaking the user’s workspace, so we need to realize the correct implementation.
Another user nearly lost their ~140GB of data. We should take data loss seriously. Even if we don’t add a support for transactions (rollback/commit), we should not break user’s workspace by moving files to the cache.
Suggestion: at least prioritize to implement a progress bar to show the ETA. Lack of pb was the reason I was interupted the add process. Thanks again for the support for getting the data back.
Another user on discord running into this: https://discord.com/channels/485586884165107732/485596304961962003/900792694023028766
Also, note that we’ve stopped poisoning our local cache with .dir, but our staging db is still memory-only, so you are no longer able to recover the metadata 🙁
Unfortunately, we are not ready to make staging persistent right now, and we clearly shouldn’t bring back poisoning, so instead we should take a closer look at the approach with trying to reflink/hardlink/copy(in this order and without symlink) instead of moving during
odb.add
indvc/fs/utils.py::transfer()
. These days our object transferring (talking about dvc/objects/transfer.py one) and checkingout are separate (Kudos @pmrowla for unifying objectsave()
and objecttransfer()
🙏 ) and in more-or-less good shape that they should be able to handle the aforementioned approach instead ofmove
. This will allow us to be failure-proof duringtransfer
(before we’ve saved .dir into .dvc/cache, user’s workspace will be intact) and during later failures (e.g. checkout) we will already have.dir
that could be used for recovery (by us in the midterm, but as a hack in the shortterm). So this might be a good partial solution.Here’s the suggestion on how to recover from a failed
dvc add
from @pmrowla, added for better discovery:@karajan1001 Good point! I meant that we could use hardlink temporarily, just so that we have an easier time rolling back if something goes wrong.
Related: https://github.com/iterative/dvc/discussions/6252.