dvc: push: Unnotified error when pushing data into HTTP remote
Bug Report
Issue name
push: Unnotified error when pushing data into HTTP remote
Description
This issue happens when pushing a bulk of files into a HTTP dvc remote. dvc push reports that everything is correct. However, when downloading the files, some of them have not been uploaded correactly and thus, they do not exist on the remote.
Reproduce
- download a dataset
- adds it to dvc
- push the data (-> No error)
- remove the cache and tmp folder inside .dvc to assure we will download the data from remote
- pull the data again (-> Some files missing)
- Push the data again (-> Everything updated!)
More detailed:
- export DATASET_FOLDER=cars_train
- export REMOTE_NAME=my_http_remote // Download random dataset
- wget http://ai.stanford.edu/~jkrause/car196/cars_train.tgz
- tar -xf cars_train.tgz
- rm cars_train.tgz
- export REMOTE_NAME_FILE=“${REMOTE_NAME/”-“/”_“}” // Try with HTTP remote: // Add and push the data
- dvc remote default ${REMOTE_NAME}
- dvc add $DATASET_FOLDER
- dvc push -v $DATASET_FOLDER // Download the data and check
- dvc remote default ${REMOTE_NAME}
- rm -rf $DATASET_FOLDER
- rm -rf .dvc/cache
- rm -rf .dvc/tmp
- dvc pull -v $DATASET_FOLDER
// Try to push again the data
$ dvc push -v $DATASET_FOLDER
Everything is up to date.
Expected
All the data in the remote, of course 😉
Environment information
Output of dvc doctor
:
$ dvc doctor
DVC version: 2.10.1 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-5.4.0-104-generic-x86_64-with-glibc2.29
Supports:
azure (adlfs = 2021.10.0, knack = 0.9.0, azure-identity = 1.7.1),
webhdfs (fsspec = 2022.1.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
ssh (sshfs = 2021.11.2)
Additional Information (if any): We had the “Session is Closed” problem prevously: pull: Using jobs>1 fails with RuntimeError: Session is closed in http remote #7421
Solved with: fs.http: prevent hangs under some network conditions #7460
And we have proposed this: dvc push doesn’t recognise that files are missing in remote storage #4164 Force push option #7268 push: add --force option to force push without .dir optimization #7532
but the problem is more serious because you don’t really know that it had failed (we would have to ask the users to try it at least twice to ensure that the data has been uploaded correctly…)
Additionally, when you try to push the files again, the .dir optimization precludes to upload again the files and dvc thinks that everything is uploaded. If the dataset have subfolders, the problem is even worse, as re-adding the files do not correct the issue due to .dir optimization.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 16 (8 by maintainers)
Hello! We have finally got the necessary permissions and we have put a version of the HTTP Remote here https://github.com/atekoa/dvc-http-remote We have tried to add two examples to help its use, with the complex example it is easier to reproduce the error. Greetings and sorry for the delay.
It works!
I have tried the fix over a multi-folder dataset against our http remote and we do not get the error now. We will check this during next week and will provide additional info.
Hey @atekoa, would you mind trying the fix suggested in the above issue to see if it solves your issue?
pip install dvc-data==0.1.15
Possibly related: #8100
I could not reproduce the issue. Closing this as it’s likely related to the custom remote being used.
Hi @atekoa, I’ve been a bit busy in the past few weeks, so I haven’t had time to work on this. I should be able to have another go sometime next week though 🙂
Thanks! I will look into it asap
cc @dtrifiro could you take a look?