datacube-core: Remove geo-registration fallback when loading rasters with missing spatial metadata

Introduction

When loading data from a file with missing geo-registration metadata datacube will not fail, instead it will silently substitute missing metadata with metadata recorded in the Dataset object.

Problems

  1. It’s a really niche feature with very little use, yet the costs are high
    • Significantly increases unit test burden
    • Not often used (i.e. tested) in production
    • Complicates IO driver interface (need to supply fallback CRS/transform)
  2. Spatial metadata is recorded per Dataset, not per band within a dataset, assuming all bands share the same geo-registration which is not the case for Landsat for example.
  3. We don’t even store enough metadata per Dataset to support the fallback: transform is not recorded, we compute it from Dataset bounding box recorded in DB and shape of the image file being loaded (shape is not recorded in the DB, yeah I know!!)
  4. Performance costs, particularly in distributed environment
    • Rather than just sending filename + band index we need to send entire Dataset object to IO worker to support a case that always never happens.

Proposal

Get rid of this “feature” and instead fail when encountering files that do not have geo-spatial information available.

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Comments: 19 (17 by maintainers)

Commits related to this issue

Most upvoted comments

Thanks for the write up. I think we can definitely remove the code for doing this.

It sounds like a huge maintenance burden for minimal or no benefit.

This “feature” was originally added to support some non-gdal-compliant datasets, but I can’t even remember which datasets, and who it was for.

We’ve taken the approach elsewhere that data added to ODC should be readable by GDAL and regularly gridded, and this decision would be in line with that.

@Kirill888 All good, we understand one another and ODC will be better for it.

FYI, I like the way you think about things, and your code is great, and you work fast. If you see me jumping on your comments, its not jumping on you, just trying to get across it before you’ve completely implemented it! 😉

I agree with what you’re saying here, @Kirill888.

I’ve been inspired by what Vladimir Agafonkin says about Leaflet. Last year he was asked what future developments Leaflet needs, and he basically said less features! https://dev.to/tomchadwin/comment/4158

Keep it simple…

@woodcockr I have asked @simonaoliver about historical context for this. It seems this have been added to support some netcdf files from BoM, and also maybe Modis. My understanding is that these files have geospatial metadata in them, but were not, at the time, readable by GDAL, this might or might have not changed since. If you have examples of production data that relies on this feature, I’m all ears, hence this issue. Better still, would love a PR with unit tests and example data.

Use is not measured, but majority of products on NCI right now are ingested products that have valid CRS and transform embedded in them. Same goes for data from USGS and Sentinel.

I do think that one is better off fixing data to be readable by GDAL, instead of relying on this “hack”, there are many options:

  1. Re-encode to a better supported format
  2. Add .aux file with geospatial data understood by GDAL
  3. Create VRT with fixed spatial data

In case of Netcdf: write a driver that uses Netcdf lib to parse/load data and don’t rely on GDAL if it has poor support for your files. And it is a “hack”, because spatial data that is stored in the Dataset yaml is only an approximation of what is really needed to implement fallback, footprint+CRS is not enough, since this assumes that Dataset footprint == Every band footprint.

Regarding your concern with respect to “filename + band index”: new IO driver design has provisions for propagating driver specific information from DB to the loading code, we are just being more precise about information flows. Current approach of “just pass around this giant Dataset object everywhere because it’s convenient”, has serious limitations. We already face RAM issues with some of our product generation on NCI at the metadata processing stage, now scale that up some more for whole of Africa cube.

This whole feature is about behaviour of a default rasterio based driver, other IO drivers can implement whatever Geospatial fallback mechanisms they desire. As it stands everybody is paying performance costs for a feature that only few use, and there are better and more correct ways to address those use cases anyway.