xclim: Heat wave frequency is taking too long?
- xclim version: 8438b0e5001c4014d0a318b974863e2b4dc4cc41
- Python version: 3.6 and 3.7
- Operating System: ubuntu
Description
Trying to compute heat_wave_frequency on a small dataset (350Kb) takes 2-3 minutes. Is that normal? You can see the list of files I used attached to this message. So each paris of files takes 2-3 minutes. Processing all of them, a couple hours…
What I Did
Here is a rough script I’m using to process this folder:
from pathlib import Path
import xarray as xr
from xclim.atmos import heat_wave_frequency
from time import perf_counter
path = "/d/xclim_data/heat_wave_test"
files = list(sorted(Path(path).glob("*.nc")))
mid = len(files) // 2
pairs = list(zip(files[:mid], files[mid:]))
t = perf_counter()
for n, pair in enumerate(pairs):
tasmax = xr.open_dataset(str(pair[0])).data_vars["tasmax"]
tasmin = xr.open_dataset(str(pair[1])).data_vars["tasmin"]
out = heat_wave_frequency(tasmax=tasmax, tasmin=tasmin)
out.to_netcdf(Path(path) / "test.nc")
print(n)
print(f"{perf_counter() - t:.2f} seconds")
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 48 (19 by maintainers)
Yes, it’s working. Heat wave frequency process went from 3h15 to 50 minutes (40 minutes subsetting, 10 minutes to compute). The results are exactly like in the files I provided.
Ah good catch. You’re right I see that on our side as well.
I don’t see a problem with us independently updating that single timestep in the metadata and comparing md5sums.
By the way, I’d be ok if you closed this in favor of #217. I might revisit this and try to lend a hand when we get to computations involving bounding boxes and polygons in finch.
We have no problem redeploying Finch several times in the next few days. Including today.
Also here are the subsetted files I used to check the results. You need to remove the
.zip
extension, github doesn’t allow 7-zip files.subset_tasmax_montreal.7z.zip subset_tasmin_montreal.7z.zip
Edit: retrying with fake .zip extension
Yes. Or at least have YS as the default with users being able to select ?
@tlogan2000 If you had time to into this today, that would be great. I think I found a faster way.
Here are the results, the first one is using the current function in xclim, and the second one my function. subset_tasmax_montreal_heat_wave - xclim.zip subset_tasmax_montreal_heat_wave - faster.zip
Running this for all the datasets, it finishes in 7-8 minutes vs almost 2 hours using the current method.
Here is my code. Basically, I replaced the
xclim.run_length.windowed_run_events
function withrolling_window_events
. So the metadata will be exactly the same. You can also see the function I used to compare the results.I would say that if the data provider (PCIC) can apply the fix to their files, send us the exact procedure so we can apply it here and then run a md5 checksum, that would be as if we got the official fixed dataset.
This file has an error in the timestamp data. 2036-10-28 time step coded as 1850-01-01
I think we should probably really rewrite the original bccaqv2 file copying the time from the tasmin file which is ok.
This might work. We would need to rechunk as before etc
see the link at the end of the first message