xarray: Let's list all the netCDF files that xarray can't open

At the Pangeo developers meetings, I am hearing lots of reports from folks like @dopplershift and @rsignell-usgs about netCDF datasets that xarray can’t open.

My expectation is that xarray doesn’t have strong requirements on the contents of datasets. (It doesn’t “enforce” cf compatibility for example; that’s optional.) Anything that can be written to netCDF should be readable by xarray.

I would like to collect examples of places where xarray fails. So far, I am only aware of one:

  • Self-referential multidimensional coordinates (#2233). Datasets which contain variables like siglay(siglay, node). Only siglay(siglay) would work.

Are there other distinct cases?

Please provide links / sample code of netCDF datasets that xarray can’t read. Even better would be short code snippets to create such datasets in python using the netcdf4 interface.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 32 (22 by maintainers)

Most upvoted comments

import xarray as xr
xr.open_dataset('http://thredds.ucar.edu/thredds/dodsC/grib/NCEP/GFS/Global_0p5deg/TwoD')
---------------------------------------------------------------------------
MissingDimensionsError                    Traceback (most recent call last)
<ipython-input-6-e2a87d803d99> in <module>()
----> 1 xr.open_dataset(gfs_cat.datasets[0].access_urls['OPENDAP'])

~/miniconda3/envs/py36/lib/python3.6/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, group, decode_cf, mask_and_scale, decode_times, autoclose, concat_characters, decode_coords, engine, chunks, lock, cache, drop_variables, backend_kwargs)
    344             lock = _default_lock(filename_or_obj, engine)
    345         with close_on_error(store):
--> 346             return maybe_decode_store(store, lock)
    347     else:
    348         if engine is not None and engine != 'scipy':

~/miniconda3/envs/py36/lib/python3.6/site-packages/xarray/backends/api.py in maybe_decode_store(store, lock)
    256             store, mask_and_scale=mask_and_scale, decode_times=decode_times,
    257             concat_characters=concat_characters, decode_coords=decode_coords,
--> 258             drop_variables=drop_variables)
    259 
    260         _protect_dataset_variables_inplace(ds, cache)

~/miniconda3/envs/py36/lib/python3.6/site-packages/xarray/conventions.py in decode_cf(obj, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables)
    428         vars, attrs, concat_characters, mask_and_scale, decode_times,
    429         decode_coords, drop_variables=drop_variables)
--> 430     ds = Dataset(vars, attrs=attrs)
    431     ds = ds.set_coords(coord_names.union(extra_coords).intersection(vars))
    432     ds._file_obj = file_obj

~/miniconda3/envs/py36/lib/python3.6/site-packages/xarray/core/dataset.py in __init__(self, data_vars, coords, attrs, compat)
    363             coords = {}
    364         if data_vars is not None or coords is not None:
--> 365             self._set_init_vars_and_dims(data_vars, coords, compat)
    366         if attrs is not None:
    367             self.attrs = attrs

~/miniconda3/envs/py36/lib/python3.6/site-packages/xarray/core/dataset.py in _set_init_vars_and_dims(self, data_vars, coords, compat)
    381 
    382         variables, coord_names, dims = merge_data_and_coords(
--> 383             data_vars, coords, compat=compat)
    384 
    385         self._variables = variables

~/miniconda3/envs/py36/lib/python3.6/site-packages/xarray/core/merge.py in merge_data_and_coords(data, coords, compat, join)
    363     indexes = dict(extract_indexes(coords))
    364     return merge_core(objs, compat, join, explicit_coords=explicit_coords,
--> 365                       indexes=indexes)
    366 
    367 

~/miniconda3/envs/py36/lib/python3.6/site-packages/xarray/core/merge.py in merge_core(objs, compat, join, priority_arg, explicit_coords, indexes)
    433     coerced = coerce_pandas_values(objs)
    434     aligned = deep_align(coerced, join=join, copy=False, indexes=indexes)
--> 435     expanded = expand_variable_dicts(aligned)
    436 
    437     coord_names, noncoord_names = determine_coords(coerced)

~/miniconda3/envs/py36/lib/python3.6/site-packages/xarray/core/merge.py in expand_variable_dicts(list_of_variable_dicts)
    209                     var_dicts.append(coords)
    210 
--> 211                 var = as_variable(var, name=name)
    212                 sanitized_vars[name] = var
    213 

~/miniconda3/envs/py36/lib/python3.6/site-packages/xarray/core/variable.py in as_variable(obj, name)
    112                 'dimensions %r. xarray disallows such variables because they '
    113                 'conflict with the coordinates used to label '
--> 114                 'dimensions.' % (name, obj.dims))
    115         obj = obj.to_index_variable()
    116 

MissingDimensionsError: 'time' has more than 1-dimension and the same name as one of its dimensions ('reftime', 'time'). xarray disallows such variables because they conflict with the coordinates used to label dimensions.

Currently, xarray requires that variables with a name matching a dimension are 1D variables along that dimension, e.g.,

for dim in dataset.dims:
    if dim in dataset.variables:
        assert dataset.variables[dim].dims == (dim,)

I agree that this unnecessarily complicates our data model. There’s no particular advantage to this invariant, besides removing the need to check the dimensions of variables used for indexing lookups. I’m sure there are some cases internally where we currently rely on this assumption, but it should be relatively easy to relax.

Currently, xarray requires that variables with a name matching a dimension are 1D variables along that dimension, e.g.,

for dim in dataset.dims:
    if dim in dataset.variables:
        assert dataset.variables[dim].dims == (dim,)

I agree that this unnecessarily complicates our data model. There’s no particular advantage to this invariant, besides removing the need to check the dimensions of variables used for indexing lookups. I’m sure there are some cases internally where we currently rely on this assumption, but it should be relatively easy to relax.

It seems like this relaxation is compatible with the refactoring of indexes.

@benbovy will the explicit indexes refactor fix this case?

This is mentioned elsewhere (can’t find the issue right now) and may be out of scope for this issue but I’m going to say it anyway: opening a NetCDF file with groups was not as easy as I wanted it to be when first starting out with xarray.

@djhoese For anything to do with opening netCDF files with groups see #4118 and the linked issues from there.

If people have example of other weird cases involving groups (like groups within themselves or anything like that) then I would be interested to have those files to test with!

@rabernat While I agree that they’re (somewhat) confusing files, I think you’re missing two things:

  1. netCDF doesn’t enforce naming on dimensions and variables. Full stop. The only naming netCDF will care about is any conflict with an internal reserved name (I’m not sure that those even exist for anything besides attributes.) IMO that’s a good thing, but more importantly it’s not the netCDF library’s job to enforce any of it.

  2. CF is an attribute convention. This also means that the conventions say absolutely nothing about naming of variables and dimensions.

IMO, xarray is being overly pedantic here. XArray states that it adopts the Common Data Model (CDM); netCDF-java and the CDM were the tools used to generate the failing examples above.

I found this problem too long ago (see #457). Back then the workaround we implemented is to exclude the offending variable (“siglay” or “isobaric” in the examples above) with the “drop_variables” optional argument. Of course this is not great if you want to actually use the values in the variable you are dropping.

I personally don’t like the notion of a “two dimensional coordinate”, I find it confusing. However this kind of netCDFs are common, so fully supporting them in xarray would be nice. But I don’t know how. Maybe just renaming the variable instead of dropping it with a “rename_variables”? This is the only thing that comes to my mind.

@TomNicholas yes with the explicit index refactor we should be able to relax the 1D coordinate / dimension matching name constraint in the Xarray data model.

I’m sure there are some cases internally where we currently rely on this assumption, but it should be relatively easy to relax.

I also initially thought it would be easy to relax, but I’m not so sure anymore. I don’t think it is a hard task, but it might still require some fair amount of work. I’ve already refactored a bunch of such internal cases in #5692, but there’s a good chance that some (not sure how many) cases will still need a fix.

Perhaps part of the confusion is simply that y has different meanings in different contexts. When used as a dimension (e.g. to “define the array shape of a Variable” in CDM terms), it is indeed 1D. When used as a variable (or “CoordinateAxis”), it is 2D. XArray doesn’t have a separate namespace for dimensions and variables.