xarray: open_mfdataset() significantly slower on 0.9.1 vs. 0.8.2

I noticed a big speed discrepancy between xarray versions 0.8.2 and 0.9.1 when using open_mfdataset() on a dataset ~ 1.2 GB in size, consisting of 3 files and using netcdf4 as the engine. 0.8.2 was run first, so this is probably not a disk caching issue.

Test

import xarray as xr
import time

start_time = time.time()
ds0 = xr.open_mfdataset('./*.nc')
print("--- %s seconds ---" % (time.time() - start_time))

Result

xarray==0.8.2, dask==0.11.1, netcdf4==1.2.4

--- 0.736030101776 seconds ---

xarray==0.9.1, dask==0.13.0, netcdf4==1.2.4

--- 52.2800869942 seconds ---

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 17 (11 by maintainers)

Most upvoted comments

Looks like it has been resolved! Tested with the latest pre-release v0.10.0rc2 on the dataset linked by najascutellatus above. https://marine.rutgers.edu/~michaesm/netcdf/data/

da.set_options(get=da.async.get_sync)
%prun -l 10 ds = xr.open_mfdataset('./*.nc')

xarray==0.10.0rc2-1-g8267fdb dask==0.15.4

         194381 function calls (188429 primitive calls) in 0.869 seconds

   Ordered by: internal time
   List reduced from 469 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       50    0.393    0.008    0.393    0.008 {numpy.core.multiarray.arange}
       50    0.164    0.003    0.557    0.011 indexing.py:266(_index_indexer_1d)
        5    0.083    0.017    0.085    0.017 netCDF4_.py:185(_open_netcdf4_group)
      190    0.024    0.000    0.066    0.000 netCDF4_.py:256(open_store_variable)
      190    0.022    0.000    0.022    0.000 netCDF4_.py:29(__init__)
       50    0.018    0.000    0.021    0.000 {operator.getitem}
5145/3605    0.012    0.000    0.019    0.000 indexing.py:493(shape)
2317/1291    0.009    0.000    0.094    0.000 _abcoll.py:548(update)
    26137    0.006    0.000    0.013    0.000 {isinstance}
      720    0.005    0.000    0.006    0.000 {method 'getncattr' of 'netCDF4._netCDF4.Variable' objects}

xarray==0.9.1 dask==0.13.0


         241253 function calls (229881 primitive calls) in 98.123 seconds

   Ordered by: internal time
   List reduced from 659 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       30   87.527    2.918   87.527    2.918 {pandas._libs.tslib.array_to_timedelta64}
       65    7.055    0.109    7.059    0.109 {operator.getitem}
       80    0.799    0.010    0.799    0.010 {numpy.core.multiarray.arange}
7895/4420    0.502    0.000    0.524    0.000 utils.py:412(shape)
       68    0.442    0.007    0.442    0.007 {pandas._libs.algos.ensure_object}
       80    0.350    0.004    1.150    0.014 indexing.py:318(_index_indexer_1d)
    60/30    0.296    0.005   88.407    2.947 timedeltas.py:158(_convert_listlike)
       30    0.284    0.009    0.298    0.010 algorithms.py:719(checked_add_with_arr)
      123    0.140    0.001    0.140    0.001 {method 'astype' of 'numpy.ndarray' objects}
 1049/719    0.096    0.000   96.513    0.134 {numpy.core.multiarray.array}