kerchunk: Failure to combine multiple JSON reference files via MultiZarrToZarr()

I have the following JSON reference files

13G Nov  2 10:07 sarah3_sid_reference_1999.json
13G Nov  2 09:58 sarah3_sid_reference_2000.json
13G Nov  2 11:00 sarah3_sid_reference_2001.json
13G Nov  2 11:08 sarah3_sid_reference_2002.json
13G Nov  2 12:04 sarah3_sid_reference_2003.json
13G Nov  2 12:12 sarah3_sid_reference_2004.json
13G Nov  2 13:07 sarah3_sid_reference_2005.json
13G Nov  2 14:29 sarah3_sid_reference_2006.json
13G Nov  2 15:27 sarah3_sid_reference_2007.json
13G Nov  2 16:45 sarah3_sid_reference_2008.json
13G Nov  2 17:43 sarah3_sid_reference_2009.json
13G Nov  2 19:02 sarah3_sid_reference_2010.json
13G Nov  2 19:58 sarah3_sid_reference_2011.json
13G Nov  2 21:25 sarah3_sid_reference_2012.json
13G Nov  2 22:13 sarah3_sid_reference_2013.json
13G Nov  2 23:43 sarah3_sid_reference_2014.json
13G Nov  3 00:36 sarah3_sid_reference_2015.json
13G Nov  3 02:03 sarah3_sid_reference_2016.json
13G Nov  3 02:58 sarah3_sid_reference_2017.json
13G Nov  3 04:24 sarah3_sid_reference_2018.json
13G Nov  3 05:21 sarah3_sid_reference_2019.json
13G Nov  3 06:48 sarah3_sid_reference_2020.json
13G Nov  3 07:41 sarah3_sid_reference_2021.json

Trying to combine them, essentially via :

        from kerchunk.combine import MultiZarrToZarr
        mzz = MultiZarrToZarr(
            reference_file_paths,
            concat_dims=['time'],
            identical_dims=['lat', 'lon'],
        )
        multifile_kerchunk = mzz.translate()

        combined_reference_filename = Path(combined_reference)
        local_fs = fsspec.filesystem('file')
        with local_fs.open(combined_reference_filename, 'wb') as f:
            f.write(ujson.dumps(multifile_kerchunk).encode())

(replace self-explained variables with file paths and output filename) in an HPC system with

❯ free -hm
              total        used        free      shared  buff/cache   available
Mem:          503Gi       4.7Gi       495Gi       2.8Gi       3.1Gi       494Gi
Swap:            0B          0B          0B

and it fails raising the following error :

│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │         cache_size = 128                                                                     │ │
│ │                dic = {'protocol': None}                                                      │ │
│ │                  f = <fsspec.implementations.local.LocalFileOpener object at 0x148d7e531270> │ │
│ │                 fo = '/project/home/p200206/data/sarah3_sid_reference_1999.json'             │ │
│ │                fo2 = '/project/home/p200206/data/sarah3_sid_reference_1999.json'             │ │
│ │                 fs = None                                                                    │ │
│ │             kwargs = {}                                                                      │ │
│ │          max_block = 256000000                                                               │ │
│ │            max_gap = 64000                                                                   │ │
│ │             ref_fs = <fsspec.implementations.local.LocalFileSystem object at 0x14f59fd1db10> │ │
│ │   ref_storage_args = None                                                                    │ │
│ │     remote_options = {}                                                                      │ │
│ │    remote_protocol = None                                                                    │ │
│ │               self = <fsspec.implementations.reference.ReferenceFileSystem object at         │ │
│ │                      0x148d7e531120>                                                         │ │
│ │   simple_templates = True                                                                    │ │
│ │             target = None                                                                    │ │
│ │     target_options = None                                                                    │ │
│ │    target_protocol = None                                                                    │ │
│ │ template_overrides = None                                                                    │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
JSONDecodeError: Could not reserve memory block

Any hints?

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Comments: 23 (10 by maintainers)

Most upvoted comments

Yes, probably! But still, to prevent you from making reference sets that are too big to handle, I really do think it should be done in parquet.

https://github.com/fsspec/kerchunk/issues/240 is about opening the datasets with zarr/xarray, not relevant here.

Well, that would be the main goal then in the end

Of course, but you are not at that point yet

Yes, you can adopt a pair-wise tree to do combining, but the exception sounds like you cannot load any reference set into memory (I note it fails on the first file).

@rsignell-usgs , do you have the time to go through making a big parquet reference set?

how do others then crunch large time series which occupy hundreds of TBs on the disk

The limiting factor for the size of the reference sets is not the total number of bytes but the total number of references, so the chunking scheme is perhaps more important here.

The daily NetCDF files are rechunked 1 x 32 x 32

Was this a choice specifically for later kerchunking, or was there another motivation? Small chunks allow for random access to single values, but they of course mean many many more references and big reference sets, as well as worse data throughput when loading contiguous data.