dvc: update: when using --to-remote, only update missing cache files

Currently what update --to-remote does is the same exact thing with import-url --to-remote with only except that update is using the stage’s url instead of taking it as an argument. This means that for each source file, even though they might already exist on our remote storage: we download them, calculate their hashes, and upload them to the remote storage.

For import --to-remote; downloading, uploading, and hash calculation happens at the same time, which means we only do 1 download + 1 upload . Though this is not as efficient as it might get for update --to-remote with a directory that consists of small files. For each file, do the same actions without ever checking whether they are present or not. Since we can not determine their presence beforehand (before actually downloading the file and calculating hash in chunks), we can optimize the update in a way that works for cases that the updated files are always less than total amount of files for a directory that consist of small files.

So instead of doing 1 download + 1 upload on every case, we can do 1 download on best cases (when the file already exists) and 2 download + 1 upload for worst cases (when the file doesn’t exist on the cache). This might be a bad optimization when the changes on the source is always bigger than the remaining, unchanged part though I think it is a bit unusual and rare case.

CC: @efiop @dberenbaum

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 18 (18 by maintainers)

Most upvoted comments

I mean dvc/remote/index.py. E.g. for old 12345.dir hash, if that object exists on s3, we can assume that every file in that directory already exists on the remote too. That’s what we use for pull/push/status right now.

Ah, I actually had a patch (in the old transfer() implementation) for that (with known_hash, maybe you remember). I might check later about whether I can port it to the current version or not.

It wasn’t directly using index but rather doing a couple stuff manually, like getting the contents with objects.load() and then removing existing ones from the search for later stages. I’ll also check remote/index.py

before #5860 after #5860
dvc import-url --to-remote unchanged unchanged
dvc update --to-remote 2m49,857s 2m24,000s

(update syncs %50 more data, and with that I think we are in a very good shape)

With #5773, the performance of update is substantially improved;

before #5773 after #5773
dvc import-url --to-remote 3m48,789s 2m10,561s
dvc update --to-remote 9m16,278s 2m49,857s

A few future steps:

  • Benefit from verify option (#5860)
  • Reduce the code duplication for different parts (add --to-remote/import-url --to-remote/update --to-remote) with implementing a common out.transfer() method (#5861)

Closing this issue since it’s initial goals (and some extras) are done, and porting the dvc/remote logic to the new odb arch definitely will have its own separate issue.

@isidentical Though please feel free to send a WIP if you feel like it, it will help the discussion, maybe we are on the same page already 😉

@isidentical Oh, that’s great! It will come in handy in the next steps for odb, let’s discuss them during our next meeting. We’ll need to port old dvc/remote logic to the new odb arch anyway, and all of those optimizations should automatically kick in for both push/pull/status/etc and --to-remote/to-cache.

(update syncs %50 more data, and with that I think we are in a very good shape)

For the record: there are a few more tricks up our sleeves, e.g. not only using verify but also leveraging indexes that we have. So it will be cut even more. 😉

I’m not following why an extra download is required if the file doesn’t exist.

For finding the only missing files, we need to calculate the hashes and find out which one of those hashes are not present in the remote. Unfortunately, we can not calculate the hashes without downloading all the files beforehand. So instead of wasting an upload on a file that might exist, we initially download everything (actually we don’t download anything to anywhere, just calculate the hashes, but we waste the same time) and then find out the missing hashes and apply the normal procedure (like this time download normally as in chunks and upload that chunks to the remote).

import-url --to-remote => download [from url] -> md5 -> upload [to remote storage]
update => download [from url] -> md5 -> download [from url, but only missing files on the remote storage] -> upload [to remote storage]