datacube-core: Remove geo-registration fallback when loading rasters with missing spatial metadata
Introduction
When loading data from a file with missing geo-registration metadata datacube will not fail, instead it will silently substitute missing metadata with metadata recorded in the Dataset object.
Problems
- It’s a really niche feature with very little use, yet the costs are high
- Significantly increases unit test burden
- Not often used (i.e. tested) in production
- Complicates IO driver interface (need to supply fallback CRS/transform)
- Spatial metadata is recorded per
Dataset, not per band within a dataset, assuming all bands share the same geo-registration which is not the case for Landsat for example. - We don’t even store enough metadata per
Datasetto support the fallback: transform is not recorded, we compute it fromDatasetbounding box recorded in DB and shape of the image file being loaded (shape is not recorded in the DB, yeah I know!!) - Performance costs, particularly in distributed environment
- Rather than just sending filename + band index we need to send entire
Datasetobject to IO worker to support a case that always never happens.
- Rather than just sending filename + band index we need to send entire
Proposal
Get rid of this “feature” and instead fail when encountering files that do not have geo-spatial information available.
About this issue
- Original URL
- State: open
- Created 5 years ago
- Comments: 19 (17 by maintainers)
Commits related to this issue
- Add deprecation warning when using override functionality (#673) — committed to opendatacube/datacube-core by Kirill888 4 years ago
- Add deprecation warning when using override functionality (#673) — committed to opendatacube/datacube-core by Kirill888 4 years ago
- Add deprecation warning when using override functionality (#673) — committed to opendatacube/datacube-core by Kirill888 4 years ago
Thanks for the write up. I think we can definitely remove the code for doing this.
It sounds like a huge maintenance burden for minimal or no benefit.
This “feature” was originally added to support some non-gdal-compliant datasets, but I can’t even remember which datasets, and who it was for.
We’ve taken the approach elsewhere that data added to ODC should be readable by GDAL and regularly gridded, and this decision would be in line with that.
@Kirill888 All good, we understand one another and ODC will be better for it.
FYI, I like the way you think about things, and your code is great, and you work fast. If you see me jumping on your comments, its not jumping on you, just trying to get across it before you’ve completely implemented it! 😉
I agree with what you’re saying here, @Kirill888.
I’ve been inspired by what Vladimir Agafonkin says about Leaflet. Last year he was asked what future developments Leaflet needs, and he basically said less features! https://dev.to/tomchadwin/comment/4158
Keep it simple…
@woodcockr I have asked @simonaoliver about historical context for this. It seems this have been added to support some netcdf files from BoM, and also maybe Modis. My understanding is that these files have geospatial metadata in them, but were not, at the time, readable by GDAL, this might or might have not changed since. If you have examples of production data that relies on this feature, I’m all ears, hence this issue. Better still, would love a PR with unit tests and example data.
Use is not measured, but majority of products on NCI right now are ingested products that have valid CRS and transform embedded in them. Same goes for data from USGS and Sentinel.
I do think that one is better off fixing data to be readable by GDAL, instead of relying on this “hack”, there are many options:
.auxfile with geospatial data understood by GDALIn case of Netcdf: write a driver that uses Netcdf lib to parse/load data and don’t rely on GDAL if it has poor support for your files. And it is a “hack”, because spatial data that is stored in the
Datasetyaml is only an approximation of what is really needed to implement fallback, footprint+CRS is not enough, since this assumes thatDataset footprint == Every band footprint.Regarding your concern with respect to “filename + band index”: new IO driver design has provisions for propagating driver specific information from DB to the loading code, we are just being more precise about information flows. Current approach of “just pass around this giant
Datasetobject everywhere because it’s convenient”, has serious limitations. We already face RAM issues with some of our product generation on NCI at the metadata processing stage, now scale that up some more for whole of Africa cube.This whole feature is about behaviour of a default rasterio based driver, other IO drivers can implement whatever Geospatial fallback mechanisms they desire. As it stands everybody is paying performance costs for a feature that only few use, and there are better and more correct ways to address those use cases anyway.