dvc: update: when using --to-remote, only update missing cache files
Currently what update --to-remote
does is the same exact thing with import-url --to-remote
with only except that update is using the stage’s url instead of taking it as an argument. This means that for each source file, even though they might already exist on our remote storage: we download them, calculate their hashes, and upload them to the remote storage.
For import --to-remote
; downloading, uploading, and hash calculation happens at the same time, which means we only do 1 download + 1 upload
. Though this is not as efficient as it might get for update --to-remote
with a directory that consists of small files. For each file, do the same actions without ever checking whether they are present or not. Since we can not determine their presence beforehand (before actually downloading the file and calculating hash in chunks), we can optimize the update
in a way that works for cases that the updated files are always less than total amount of files for a directory that consist of small files.
So instead of doing 1 download + 1 upload
on every case, we can do 1 download
on best cases (when the file already exists) and 2 download + 1 upload
for worst cases (when the file doesn’t exist on the cache). This might be a bad optimization when the changes on the source is always bigger than the remaining, unchanged part though I think it is a bit unusual and rare case.
CC: @efiop @dberenbaum
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 18 (18 by maintainers)
Ah, I actually had a patch (in the old transfer() implementation) for that (with
known_hash
, maybe you remember). I might check later about whether I can port it to the current version or not.It wasn’t directly using index but rather doing a couple stuff manually, like getting the contents with
objects.load()
and then removing existing ones from the search for later stages. I’ll also check remote/index.py(update syncs %50 more data, and with that I think we are in a very good shape)
With #5773, the performance of
update
is substantially improved;A few future steps:
verify
option (#5860)out.transfer()
method (#5861)Closing this issue since it’s initial goals (and some extras) are done, and porting the
dvc/remote
logic to the new odb arch definitely will have its own separate issue.@isidentical Though please feel free to send a WIP if you feel like it, it will help the discussion, maybe we are on the same page already 😉
@isidentical Oh, that’s great! It will come in handy in the next steps for odb, let’s discuss them during our next meeting. We’ll need to port old dvc/remote logic to the new odb arch anyway, and all of those optimizations should automatically kick in for both push/pull/status/etc and --to-remote/to-cache.
For the record: there are a few more tricks up our sleeves, e.g. not only using
verify
but also leveraging indexes that we have. So it will be cut even more. 😉For finding the only missing files, we need to calculate the hashes and find out which one of those hashes are not present in the remote. Unfortunately, we can not calculate the hashes without downloading all the files beforehand. So instead of wasting an upload on a file that might exist, we initially download everything (actually we don’t download anything to anywhere, just calculate the hashes, but we waste the same time) and then find out the missing hashes and apply the normal procedure (like this time download normally as in chunks and upload that chunks to the remote).