kerchunk: Failure to combine multiple JSON reference files via MultiZarrToZarr()
I have the following JSON reference files
13G Nov 2 10:07 sarah3_sid_reference_1999.json
13G Nov 2 09:58 sarah3_sid_reference_2000.json
13G Nov 2 11:00 sarah3_sid_reference_2001.json
13G Nov 2 11:08 sarah3_sid_reference_2002.json
13G Nov 2 12:04 sarah3_sid_reference_2003.json
13G Nov 2 12:12 sarah3_sid_reference_2004.json
13G Nov 2 13:07 sarah3_sid_reference_2005.json
13G Nov 2 14:29 sarah3_sid_reference_2006.json
13G Nov 2 15:27 sarah3_sid_reference_2007.json
13G Nov 2 16:45 sarah3_sid_reference_2008.json
13G Nov 2 17:43 sarah3_sid_reference_2009.json
13G Nov 2 19:02 sarah3_sid_reference_2010.json
13G Nov 2 19:58 sarah3_sid_reference_2011.json
13G Nov 2 21:25 sarah3_sid_reference_2012.json
13G Nov 2 22:13 sarah3_sid_reference_2013.json
13G Nov 2 23:43 sarah3_sid_reference_2014.json
13G Nov 3 00:36 sarah3_sid_reference_2015.json
13G Nov 3 02:03 sarah3_sid_reference_2016.json
13G Nov 3 02:58 sarah3_sid_reference_2017.json
13G Nov 3 04:24 sarah3_sid_reference_2018.json
13G Nov 3 05:21 sarah3_sid_reference_2019.json
13G Nov 3 06:48 sarah3_sid_reference_2020.json
13G Nov 3 07:41 sarah3_sid_reference_2021.json
Trying to combine them, essentially via :
from kerchunk.combine import MultiZarrToZarr
mzz = MultiZarrToZarr(
reference_file_paths,
concat_dims=['time'],
identical_dims=['lat', 'lon'],
)
multifile_kerchunk = mzz.translate()
combined_reference_filename = Path(combined_reference)
local_fs = fsspec.filesystem('file')
with local_fs.open(combined_reference_filename, 'wb') as f:
f.write(ujson.dumps(multifile_kerchunk).encode())
(replace self-explained variables with file paths and output filename) in an HPC system with
❯ free -hm
total used free shared buff/cache available
Mem: 503Gi 4.7Gi 495Gi 2.8Gi 3.1Gi 494Gi
Swap: 0B 0B 0B
and it fails raising the following error :
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │ cache_size = 128 │ │
│ │ dic = {'protocol': None} │ │
│ │ f = <fsspec.implementations.local.LocalFileOpener object at 0x148d7e531270> │ │
│ │ fo = '/project/home/p200206/data/sarah3_sid_reference_1999.json' │ │
│ │ fo2 = '/project/home/p200206/data/sarah3_sid_reference_1999.json' │ │
│ │ fs = None │ │
│ │ kwargs = {} │ │
│ │ max_block = 256000000 │ │
│ │ max_gap = 64000 │ │
│ │ ref_fs = <fsspec.implementations.local.LocalFileSystem object at 0x14f59fd1db10> │ │
│ │ ref_storage_args = None │ │
│ │ remote_options = {} │ │
│ │ remote_protocol = None │ │
│ │ self = <fsspec.implementations.reference.ReferenceFileSystem object at │ │
│ │ 0x148d7e531120> │ │
│ │ simple_templates = True │ │
│ │ target = None │ │
│ │ target_options = None │ │
│ │ target_protocol = None │ │
│ │ template_overrides = None │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
JSONDecodeError: Could not reserve memory block
Any hints?
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Comments: 23 (10 by maintainers)
Yes, probably! But still, to prevent you from making reference sets that are too big to handle, I really do think it should be done in parquet.
Of course, but you are not at that point yet
Yes, you can adopt a pair-wise tree to do combining, but the exception sounds like you cannot load any reference set into memory (I note it fails on the first file).
@rsignell-usgs , do you have the time to go through making a big parquet reference set?
The limiting factor for the size of the reference sets is not the total number of bytes but the total number of references, so the chunking scheme is perhaps more important here.
Was this a choice specifically for later kerchunking, or was there another motivation? Small chunks allow for random access to single values, but they of course mean many many more references and big reference sets, as well as worse data throughput when loading contiguous data.