xclim: Heat wave frequency is taking too long?

  • xclim version: 8438b0e5001c4014d0a318b974863e2b4dc4cc41
  • Python version: 3.6 and 3.7
  • Operating System: ubuntu

Description

Trying to compute heat_wave_frequency on a small dataset (350Kb) takes 2-3 minutes. Is that normal? You can see the list of files I used attached to this message. So each paris of files takes 2-3 minutes. Processing all of them, a couple hours…

What I Did

Here is a rough script I’m using to process this folder:

from pathlib import Path

import xarray as xr
from xclim.atmos import heat_wave_frequency
from time import perf_counter

path = "/d/xclim_data/heat_wave_test"
files = list(sorted(Path(path).glob("*.nc")))

mid = len(files) // 2
pairs = list(zip(files[:mid], files[mid:]))

t = perf_counter()

for n, pair in enumerate(pairs):
    tasmax = xr.open_dataset(str(pair[0])).data_vars["tasmax"]
    tasmin = xr.open_dataset(str(pair[1])).data_vars["tasmin"]
    out = heat_wave_frequency(tasmax=tasmax, tasmin=tasmin)
    out.to_netcdf(Path(path) / "test.nc")
    print(n)

print(f"{perf_counter() - t:.2f} seconds")

heat_wave_test - sample.zip

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 48 (19 by maintainers)

Most upvoted comments

Yes, it’s working. Heat wave frequency process went from 3h15 to 50 minutes (40 minutes subsetting, 10 minutes to compute). The results are exactly like in the files I provided.

Ah good catch. You’re right I see that on our side as well.

I don’t see a problem with us independently updating that single timestep in the metadata and comparing md5sums.

By the way, I’d be ok if you closed this in favor of #217. I might revisit this and try to lend a hand when we get to computations involving bounding boxes and polygons in finch.

We have no problem redeploying Finch several times in the next few days. Including today.

Also here are the subsetted files I used to check the results. You need to remove the .zip extension, github doesn’t allow 7-zip files.

subset_tasmax_montreal.7z.zip subset_tasmin_montreal.7z.zip

Edit: retrying with fake .zip extension

Note the heat wave calculations for the portal was done only for the annual frequency : ‘YS’

Hum… for the BCCAQv2_heat_wave_frequency_gridpoint job, it’s currently hard-coded as ‘MS’ in the frontend, should that be changed?

Yes. Or at least have YS as the default with users being able to select ?

@tlogan2000 If you had time to into this today, that would be great. I think I found a faster way.

Here are the results, the first one is using the current function in xclim, and the second one my function. subset_tasmax_montreal_heat_wave - xclim.zip subset_tasmax_montreal_heat_wave - faster.zip

Running this for all the datasets, it finishes in 7-8 minutes vs almost 2 hours using the current method.

Here is my code. Basically, I replaced the xclim.run_length.windowed_run_events function with rolling_window_events. So the metadata will be exactly the same. You can also see the function I used to compare the results.

import warnings
from pathlib import Path
from time import perf_counter

import xarray as xr
import numpy as np
from xclim.atmos import heat_wave_frequency
import xclim.run_length


def rolling_window_events(da, window, dim="time"):
    window_count = da.rolling(time=window).sum()
    w = window_count.values[window - 1 :] >= window

    count = np.count_nonzero(w[1:] > w[:-1]) + w[0]

    data = np.array([count], dtype=np.int64)
    data = data.reshape(())
    out = xr.DataArray(data, coords={"lon": da.lon, "lat": da.lat})

    return out


def compute_heat_wave(output_path, tasmax_file, tasmin_file):
    t = perf_counter()

    tasmax = xr.open_dataset(str(tasmax_file)).data_vars["tasmax"]
    tasmin = xr.open_dataset(str(tasmin_file)).data_vars["tasmin"]
    thresh_tasmax = "30 degC"
    thresh_tasmin = "20 degC"
    out = heat_wave_frequency(
        tasmax=tasmax,
        tasmin=tasmin,
        thresh_tasmax=thresh_tasmax,
        thresh_tasmin=thresh_tasmin,
        window=2,
        freq="MS",
    )
    output_name = output_path / tasmin_file.stem.replace(
        "tasmin", "heat_wave_frequency"
    )
    out.to_netcdf(str(output_name) + ".nc")

    print(f"{perf_counter() - t:.2f} seconds")


def compare_data():
    heat_wave_faster = "/d/xclim_data/subset_tasmax_montreal_heat_wave - faster"
    heat_wave_xclim = "/d/xclim_data/subset_tasmax_montreal_heat_wave - xclim"

    hw_faster_files = list(sorted(Path(heat_wave_faster).glob("*.nc")))
    hw_xclim_files = list(sorted(Path(heat_wave_xclim).glob("*.nc")))

    for f1, f2 in zip(hw_xclim_files, hw_faster_files):
        ds1 = xr.open_dataset(f1)
        ds2 = xr.open_dataset(f2)
        np.allclose(ds1.heat_wave_frequency.values, ds2.heat_wave_frequency.values)


def main():
    # monkeypatch windowed_run_events with a faster version
    xclim.run_length.windowed_run_events = rolling_window_events

    tasmin_files = list(
        sorted(Path("/d/xclim_data/subset_tasmin_montreal").glob("*.nc"))
    )
    tasmax_files = list(
        sorted(Path("/d/xclim_data/subset_tasmax_montreal").glob("*.nc"))
    )

    output_path = Path("/d/xclim_data/subset_tasmax_montreal_heat_wave")
    output_path.mkdir(exist_ok=True)

    t = perf_counter()

    for tasmin_file, tasmax_file in zip(tasmin_files, tasmax_files):
        compute_heat_wave(output_path, tasmax_file, tasmin_file)

    print(f"{perf_counter() - t:.2f} seconds")


if __name__ == "__main__":
    warnings.filterwarnings("ignore", category=FutureWarning)
    warnings.filterwarnings("ignore", category=UserWarning)
    main()

I would say that if the data provider (PCIC) can apply the fix to their files, send us the exact procedure so we can apply it here and then run a md5 checksum, that would be as if we got the official fixed dataset.

Also, could you check the file tasmax_day_BCCAQv2+ANUSPLIN300_CESM1-CAM5_historical+rcp85_r1i1p1_19500101-21001231_sub.nc?

When running assert_daily on it, it raises an error saying the time is not monotinocally increasing. The assert passes for all the other files…

This file has an error in the timestamp data. 2036-10-28 time step coded as 1850-01-01

I think we should probably really rewrite the original bccaqv2 file copying the time from the tasmin file which is ok.

This might work. We would need to rechunk as before etc

ds_tasmin = xr.open_dataset('tasmin_day_BCCAQv2+ANUSPLIN300_CESM1-CAM5_historical+rcp85_r1i1p1_19500101-21001231_sub.nc', decode_times=False)
ds_tasmax = xr.open_dataset('tasmax_day_BCCAQv2+ANUSPLIN300_CESM1-CAM5_historical+rcp85_r1i1p1_19500101-21001231_sub.nc',mask_and_scale=False, drop_variables='time')
ds_tasmax['time'] = ds_tasmin.time
ds_tasmax.to_netcdf('corrected_file.nc')

can you provide the test files?

see the link at the end of the first message