xarray: Let's list all the netCDF files that xarray can't open
At the Pangeo developers meetings, I am hearing lots of reports from folks like @dopplershift and @rsignell-usgs about netCDF datasets that xarray can’t open.
My expectation is that xarray doesn’t have strong requirements on the contents of datasets. (It doesn’t “enforce” cf compatibility for example; that’s optional.) Anything that can be written to netCDF should be readable by xarray.
I would like to collect examples of places where xarray fails. So far, I am only aware of one:
- Self-referential multidimensional coordinates (#2233). Datasets which contain variables like
siglay(siglay, node)
. Onlysiglay(siglay)
would work.
Are there other distinct cases?
Please provide links / sample code of netCDF datasets that xarray can’t read. Even better would be short code snippets to create such datasets in python using the netcdf4 interface.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 32 (22 by maintainers)
Currently, xarray requires that variables with a name matching a dimension are 1D variables along that dimension, e.g.,
I agree that this unnecessarily complicates our data model. There’s no particular advantage to this invariant, besides removing the need to check the dimensions of variables used for indexing lookups. I’m sure there are some cases internally where we currently rely on this assumption, but it should be relatively easy to relax.
@benbovy will the explicit indexes refactor fix this case?
@djhoese For anything to do with opening netCDF files with groups see #4118 and the linked issues from there.
If people have example of other weird cases involving groups (like groups within themselves or anything like that) then I would be interested to have those files to test with!
@rabernat While I agree that they’re (somewhat) confusing files, I think you’re missing two things:
netCDF doesn’t enforce naming on dimensions and variables. Full stop. The only naming netCDF will care about is any conflict with an internal reserved name (I’m not sure that those even exist for anything besides attributes.) IMO that’s a good thing, but more importantly it’s not the netCDF library’s job to enforce any of it.
CF is an attribute convention. This also means that the conventions say absolutely nothing about naming of variables and dimensions.
IMO, xarray is being overly pedantic here. XArray states that it adopts the Common Data Model (CDM); netCDF-java and the CDM were the tools used to generate the failing examples above.
I found this problem too long ago (see #457). Back then the workaround we implemented is to exclude the offending variable (“siglay” or “isobaric” in the examples above) with the “drop_variables” optional argument. Of course this is not great if you want to actually use the values in the variable you are dropping.
I personally don’t like the notion of a “two dimensional coordinate”, I find it confusing. However this kind of netCDFs are common, so fully supporting them in xarray would be nice. But I don’t know how. Maybe just renaming the variable instead of dropping it with a “rename_variables”? This is the only thing that comes to my mind.
@TomNicholas yes with the explicit index refactor we should be able to relax the 1D coordinate / dimension matching name constraint in the Xarray data model.
I also initially thought it would be easy to relax, but I’m not so sure anymore. I don’t think it is a hard task, but it might still require some fair amount of work. I’ve already refactored a bunch of such internal cases in #5692, but there’s a good chance that some (not sure how many) cases will still need a fix.
Perhaps part of the confusion is simply that
y
has different meanings in different contexts. When used as a dimension (e.g. to “define the array shape of a Variable” in CDM terms), it is indeed 1D. When used as a variable (or “CoordinateAxis”), it is 2D. XArray doesn’t have a separate namespace for dimensions and variables.