pangeo-forge-recipes: Xarray-to-Zarr recipe runs out of memory
I’m following the PangeoForge recipe tutorial Xarray-to-Zarr Sequential Recipe: NOAA OISST to create a recipe for CEDA monthly daytime land surface temperature data, but I’m running into issues using pangeo-forge-recipes version 0.10.0 (obtained via conda-forge).
Here’s my code in recipe.py (I’m using python 3.11):
import os
from tempfile import TemporaryDirectory
import apache_beam as beam
import pandas as pd
import xarray as xr
from pangeo_forge_recipes.patterns import pattern_from_file_sequence
from pangeo_forge_recipes.transforms import (
OpenURLWithFSSpec,
OpenWithXarray,
StoreToZarr,
)
url_pattern = (
"https://dap.ceda.ac.uk/neodc/esacci/land_surface_temperature/data/"
"MULTISENSOR_IRCDR/L3S/0.01/v2.00/monthly/{time:%Y}/{time:%m}/"
"ESACCI-LST-L3S-LST-IRCDR_-0.01deg_1MONTHLY_DAY-{time:%Y%m}01000000-fv2.00.nc"
)
months = pd.date_range("1995-08", "2020-12", freq=pd.offsets.MonthBegin())
urls = tuple(url_pattern.format(time=month) for month in months)
# Prune to 1 element to minimize memory reqs for now
pattern = pattern_from_file_sequence(urls, "time", nitems_per_file=1).prune(1)
temp_dir = TemporaryDirectory()
target_root = temp_dir.name
store_name = "output.zarr"
target_store = os.path.join(target_root, store_name)
transforms = (
beam.Create(pattern.items())
| OpenURLWithFSSpec()
| OpenWithXarray(file_type=pattern.file_type)
| StoreToZarr(
target_root=target_root,
store_name=store_name,
combine_dims=pattern.combine_dim_keys,
target_chunks={"time": 1, "lat": 5, "lon": 5},
)
)
print(f"{pattern=}")
print(f"{target_store=}")
print(f"{transforms=}")
with beam.Pipeline() as p:
p | transforms # type: ignore[reportUnusedExpression]
with xr.open_zarr(target_store) as ds:
print(ds)
When I run this, it is eventually killed because it consumes an obscene amount of memory. I saw the python process exceed 40G of memory (on my 16G machine), but it may very well have gone beyond that while I wasn’t watching it – it ran for about 3.5 hours!:
$ time python recipe.py
pattern=<FilePattern {'time': 1}>
target_store='/var/folders/v_/q9ql2x2n3dlg2td_b6xkcjzw0000gn/T/tmpozgcr3ng/output.zarr'
transforms=<_ChainedPTransform(PTransform) label=[Create|OpenURLWithFSSpec|OpenWithXarray|StoreToZarr] at 0x162819b90>
...
.../python3.11/site-packages/xarray/core/dataset.py:2461: SerializationWarning: saving variable None with floating point data as an integer dtype without any _FillValue to use for NaNs
return to_zarr( # type: ignore[call-overload,misc]
Killed: 9
real 216m31.108s
user 76m14.794s
sys 90m21.965s
.../python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
I’m going to downgrade pangeo-forge-recipes to a version prior to the recently introduced breaking API changes to see if I encounter the same problem with the old API, but in the meantime, is there anything glaringly wrong with what I’ve written above that would cause the memory issue?
About this issue
- Original URL
- State: open
- Created 10 months ago
- Reactions: 1
- Comments: 15 (15 by maintainers)
I finally got a successful run. I used
target_chunks={"lat": 3600, "lon": 7200}and it took ~30 min. to complete.I now see that the result is float64 that you mentioned, even though the original is float32. Is there any way for us to prevent this unwanted conversion?
I suspect the problem may be on the CEDA server side. While using the CEDA OPeNDAP server during the course of some experimental work (which is what led me to look at writing this recipe, on the advice of @sharkinsspatial), we’ve experienced a great deal of flakiness.
This is the problem
You are trying to decimate this very large dataset into a tiny, tiny number of chunks. This is overwhelming beam with a vast number of tasks.
The original dataset dimensions are
{'time': 1, 'lat': 18000, 'lon': 36000, 'length_scale': 1, 'channel': 2}. So this pipeline is creating 25920000 different tasks (one for each chunk).Beyond the inefficient pipeline, this Zarr with these tiny chunks would be extremely hard to use. We generally aim for chunks from 1-100 MB. The following chunks seemed to work for me.
However, then I hit a new error
which I swear I have seen before but can’t remember where.