delta-rs: Not able to access Azure Delta Lake
Discussed in https://github.com/delta-io/delta-rs/discussions/599
<div type='discussions-op-text'>Originally posted by ganesh-gawande May 9, 2022 Hi,
I am using the documentation - https://github.com/delta-io/delta-rs/blob/main/docs/ADLSGen2-HOWTO.md I tried many version of paths - but not able to access the Delta lake.
Following error received - Not a Delta table: No snapshot or version 0 found OR Invalid object URI
Here are the paths I have tried in my code but nothing works.
delta = DeltaTable("adls2://{ContainerName}@{StorageAccountName}.dfs.core.windows.net")
delta = DeltaTable("adls2://{StorageAccountName}/{ContainerName}/{Folder1}/{Folder2}/{FileName}.parquet")
delta = DeltaTable("adls2://{StorageAccountName}/{ContainerName}/{DeltaTableNameFromDatabricks}")
delta = DeltaTable("adls2://{StorageAccountName}/{ContainerName}/")
delta = DeltaTable("adls2://{ContainerName}@{StorageAccountName}.dfs.core.windows.net/{ContainerName}/{DeltaTableNameFromDatabricks}")
delta = DeltaTable("abfss://{ContainerName}@{StorageAccountName}.dfs.core.windows.net/{ContainerName}/{DeltaTableNameFromDatabricks}")
delta = DeltaTable("abfss://{ContainerName}@{StorageAccountName}.dfs.core.windows.net/{ContainerName}/")
delta = DeltaTable("abfss://{ContainerName}@{StorageAccountName}.dfs.core.windows.net/")
</div>About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 58
@roeap _ I confirmed that the issue reported above is resolved in version 0.7.0. I am able to connect Azure Storage account with the change in path with az://{containerName}/path and with storage options parameter.
I am using the release version which I installed via pip. @roeap . Alright will be awaiting this feature in the next release I suppose.
@ganesh-gawande - so the path you should be using is
adls2://{StorageAccountName}/{ContainerName}/. After #603 is mergedadls2://{StorageAccountName}/{ContainerName}should also work.However I also tried loading a delta log with initial commit files remove, which only work if there is a
_last_commitfile present. When that file is missing we see the exact error message you encountered.@wjones127 @houqp - I do remember the protocol explicitly mentioning lexicographical sort to work with the log. Should we implement that logic, or make sure first that delta needs to support finding checkpoints w/o that file. or are we already sure š.
I guess the core logic from loading a specific version can already largely be reused. Likely we would also want to mirror the logic in our writers to create a checkpoint every ten commits.
@roeap Actually, the first delta entry is not guaranteed to exist. See my update in https://github.com/delta-io/delta/pull/913
Not sure if we are testing that in this repo though.
hmm strange ⦠this seems like a corruption in the delta log to me⦠when databricks creates a checkpoint it should also create a
_last_checkpointfile. The rust implementation relies on either identifying the latest checkpoint via that file or starting from the beginning.One way to load the table could be to use the
load_versionfunction i.e.table.load_version(85996). Looking at the delta sepcification really quick it seems to me this scenario (i.e. parts of the log missing) is not something a reader needs to support, but would be resilient to if the last checkpoint file exists.If you use the
load_versioncommand mentioned above we search for the closest checkpoint with a lower or equal version and that āshouldā work. So it should work with any version higher then that checkpoints version. THe reasoing for all that logic is tables exactly like yours, where listing a directory with 10s of thousands of files becomes prohibitively expensiveā¦Iād be interested to know if databricks is able to load that table without specifying a specific version.