xarray: Segfault writing large netcdf files to s3fs
What happened?
It seems netcdf4 does not work well currently with s3fs
the FUSE filesystem layer over S3 compatible storage with either the default netcdf4
engine nor with the h5netcdf
.
Here is an example
import numpy as np
import xarray as xr
from datetime import datetime, timedelta
NTIMES=48
start = datetime(2022,10,6,0,0)
time_vals = [start + timedelta(minutes=20*t) for t in range(NTIMES)]
times = xr.DataArray(data = [t.strftime('%Y%m%d%H%M%S').encode() for t in time_vals], dims=['Time'])
v1 = xr.DataArray(data=np.zeros((len(times), 201, 201)), dims=['Time', 'x', 'y'])
ds = xr.Dataset(data_vars=dict(times=times, v1=v1))
ds.to_netcdf(path='/my_s3_fs/test_netcdf.nc', format='NETCDF4', mode='w')
On my system this code crashes with NTIMES=48, but completes without an error with NTIMES=24.
The output with NTIMES=48
is
There are 1 HDF5 objects open!
Report: open objects on 72057594037927936
Segmentation fault (core dumped)
I have tried the other engine that handles NETCDF4 in xarray with engine='h5netcdf'
and also got a segfault.
A quick workaround seems to be to use the local filesystem to write the NetCDF file and then move the complete file to S3.
ds.to_netcdf(path='/tmp/test_netcdf.nc', format='NETCDF4', mode='w')
shutil.move('/tmp/test_netcdf.nc', '/my_s3_fs/test_netcdf.nc')
There are several pieces of software involved here: the xarray package (0.16.1), netcdf4 (1.5.4), HDF5 (1.10.6), and s3fs (1.79). If this is not a bug in my code but in the underlying libraries, most likely it is not an xarray bug, but since it fails with both Netcdf4 engines, I decided to report it here.
What did you expect to happen?
With NTIMES=24 I am getting a file /my_s3_fs/test_netcdf.nc
of about 7.8 MBytes. WIth NTIMES=36 I get an empty file. I would expect to have this code run without a segfault and produce a nonempty file.
Minimal Complete Verifiable Example
import numpy as np
import xarray as xr
from datetime import datetime, timedelta
NTIMES=48
start = datetime(2022,10,6,0,0)
time_vals = [start + timedelta(minutes=20*t) for t in range(NTIMES)]
times = xr.DataArray(data = [t.strftime('%Y%m%d%H%M%S').encode() for t in time_vals], dims=['Time'])
v1 = xr.DataArray(data=np.zeros((len(times), 201, 201)), dims=['Time', 'x', 'y'])
ds = xr.Dataset(data_vars=dict(times=times, v1=v1))
ds.to_netcdf(path='/my_s3_fs/test_netcdf.nc', format='NETCDF4', mode='w')
MVCE confirmation
- Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- Complete example — the example is self-contained, including all data and the text of any traceback.
- Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- New issue — a search of GitHub Issues suggests this is not a duplicate.
Relevant log output
There are 1 HDF5 objects open!
Report: open objects on 72057594037927936
Segmentation fault (core dumped)
Anything else we need to know?
No response
Environment
INSTALLED VERSIONS
commit: None python: 3.8.3 | packaged by conda-forge | (default, Jun 1 2020, 17:43:00) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 5.4.0-26-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: None LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: 4.7.4
xarray: 0.16.1 pandas: 1.1.3 numpy: 1.19.1 scipy: 1.5.2 netCDF4: 1.5.4 pydap: None h5netcdf: 1.0.2 h5py: 3.1.0 Nio: None zarr: None cftime: 1.2.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.30.0 distributed: None matplotlib: 3.3.1 cartopy: None seaborn: None numbagg: None pint: None setuptools: 50.3.0.post20201006 pip: 20.2.3 conda: 22.9.0 pytest: 6.1.1 IPython: 7.18.1 sphinx: None
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 16
Since we have eliminated
xarray
with this, you should be able to submit an issue to theh5py
issue tracker while mentioning that this is probably a bug inlibhdf5
sincenetcdf4
also fails with the same error (and you can also link this issue for more information)I had to change ints and floats to doubles to reproduce the issue.
Can confirm the issue with xarray 2022.6.0 and dask 2022.9.2. The latest versions available on conda-forge. The issue might be related to netcdf4 and hdf5 libraries. Will try to update this as well.