dvc: add --external: fails using Azure remote

Bug Report

Description

I am trying to track existing data from a storage account in Azure following current documentation.

Reproduce

  1. dvc init
  2. dvc remote add azcore azure://core-container
  3. dvc remote add azdata azure://data-container
  4. dvc add --external remote://azdata/existing-data

Expected

I’m not sure what is expected but the output is:

ERROR: unexpected error - : 'azure'

Environment information

Output of dvc doctor:

DVC version: 2.38.1 (pip)
---------------------------------
Platform: Python 3.9.6 on macOS-13.1-x86_64-i386-64bit
Subprojects:
	dvc_data = 0.28.4
	dvc_objects = 0.14.0
	dvc_render = 0.0.15
	dvc_task = 0.1.8
	dvclive = 1.3.1
	scmrepo = 0.1.4
Supports:
	azure (adlfs = 2022.11.2, knack = 0.10.1, azure-identity = 1.12.0),
	http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s5s1
Caches: local
Remotes: azure, azure
Workspace directory: apfs on /dev/disk1s5s1
Repo: dvc, git

Additional Information:

2023-01-04 18:58:46,616 ERROR: unexpected error - : 'azure'
------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/rmllopes/dev/auto-document-validation-ai/.venv/lib/python3.9/site-packages/dvc/odbmgr.py", line 65, in __getattr__
    return self._odb[name]
KeyError: 'azure'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/rmllopes/dev/auto-document-validation-ai/.venv/lib/python3.9/site-packages/dvc/cli/__init__.py", line 185, in main
    ret = cmd.do_run()
  File "/Users/rmllopes/dev/auto-document-validation-ai/.venv/lib/python3.9/site-packages/dvc/cli/command.py", line 22, in do_run
    return self.run()
  File "/Users/rmllopes/dev/auto-document-validation-ai/.venv/lib/python3.9/site-packages/dvc/commands/add.py", line 53, in run
    self.repo.add(
  File "/Users/rmllopes/dev/auto-document-validation-ai/.venv/lib/python3.9/site-packages/dvc/utils/collections.py", line 164, in inner
    result = func(*ba.args, **ba.kwargs)
  File "/Users/rmllopes/dev/auto-document-validation-ai/.venv/lib/python3.9/site-packages/dvc/repo/__init__.py", line 48, in wrapper
    return f(repo, *args, **kwargs)
  File "/Users/rmllopes/dev/auto-document-validation-ai/.venv/lib/python3.9/site-packages/dvc/repo/scm_context.py", line 156, in run
    return method(repo, *args, **kw)
  File "/Users/rmllopes/dev/auto-document-validation-ai/.venv/lib/python3.9/site-packages/dvc/repo/add.py", line 190, in add
    stage.save(merge_versioned=True)
  File "/Users/rmllopes/dev/auto-document-validation-ai/.venv/lib/python3.9/site-packages/dvc/stage/__init__.py", line 469, in save
    self.save_outs(
  File "/Users/rmllopes/dev/auto-document-validation-ai/.venv/lib/python3.9/site-packages/dvc/stage/__init__.py", line 512, in save_outs
    out.save()
  File "/Users/rmllopes/dev/auto-document-validation-ai/.venv/lib/python3.9/site-packages/dvc/output.py", line 643, in save
    self.odb,
  File "/Users/rmllopes/dev/auto-document-validation-ai/.venv/lib/python3.9/site-packages/dvc/output.py", line 450, in odb
    odb = getattr(self.repo.odb, odb_name)
  File "/Users/rmllopes/dev/auto-document-validation-ai/.venv/lib/python3.9/site-packages/dvc/odbmgr.py", line 67, in __getattr__
    raise AttributeError from exc
AttributeError
------------------------------------------------------------
2023-01-04 18:58:46,711 DEBUG: Version info for developers:
DVC version: 2.38.1 (pip)
---------------------------------
Platform: Python 3.9.6 on macOS-13.1-x86_64-i386-64bit
Subprojects:
	dvc_data = 0.28.4
	dvc_objects = 0.14.0
	dvc_render = 0.0.15
	dvc_task = 0.1.8
	dvclive = 1.3.1
	scmrepo = 0.1.4
Supports:
	azure (adlfs = 2022.11.2, knack = 0.10.1, azure-identity = 1.12.0),
	http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3)
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: azure, azure
Workspace directory: apfs on /dev/disk1s5s1
Repo: dvc, git

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2023-01-04 18:58:46,714 DEBUG: Analytics is enabled.
2023-01-04 18:58:46,911 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/var/folders/st/05s6bkj55r9cw3hbrrdfvfqh0000gp/T/tmpoxhcmxev']'
2023-01-04 18:58:46,913 DEBUG: Spawned '['daemon', '-q', 'analytics', '/var/folders/st/05s6bkj55r9cw3hbrrdfvfqh0000gp/T/tmpoxhcmxev']'

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 24 (13 by maintainers)

Most upvoted comments

Thanks for the input guys, it helped me settle the pipeline that we will use and work around some of the limitations using Azure. I think it makes sense to keep this issue open as a feature request but of course I’ll leave that at your discretion.

No, @rmlopes . Thanks, glad to see that we’ve settled on something after all 😃

it looks like it always downloads all the files when we do the update (I would expect it to only download what is new). Because of this, even if there are no new files stage 1 will always run after an update.

I think you can overcome this by introducing an extra stage, that would list files into a “list.txt” and make stage that downloads them locally depend on this list. If storage is append-only, immutable I would even prefer this way over import-url since it should be faster.

The bug still should be fixed though, for the import-url downloading things every time (cc @dberenbaum )?

Would a functional integration with cloud versioning solve the problem in this case?

No, as far as I understand it won’t help. Let’s imagine multiple people want to train something simultaneously, since Azure expects a specific layout in the folder and it’s the same one folder, I can’t find way in my head how to make two different splits simultaneously in the same location. A better way would an output folder on Azure a param (and you can use dvc.yaml templating to substitute is with a value when the pipeline is running) that person can specify in params.yaml? You then should be prepared that you will end up with many folders on Azure storage with different splits (they can be removed after training is done btw).

Let me know if that makes sense, I can explain better or show an example if needed.

Hopefully I was able to clarify everything and not make it worst 😃

Yes, it’s clear now what’s going on. Thanks 🙏

@dberenbaum yes, enabling blob versioning is a possibility (I don’t see anything against it)

I looked into the documents and didn’t find the azure support for the external data, Looks like we hadn’t implemented it for now. So it’s a new feature request instead of a bug? @dberenbaum