arrow: [C++][Python] Large strings cause ArrowInvalid: offset overflow while concatenating arrays

When working with medium-sized datasets that have very long strings, arrow fails when trying to operate on the strings. The root is the combine_chunks function.

Here is a minimally reproducible example

import numpy as np
import pyarrow as pa

# Create a large string
x = str(np.random.randint(low=0,high=1000, size=(30000,)).tolist())
t = pa.chunked_array([x]*20_000)
# Combine the chunks into large string array - fails
combined = t.combine_chunks()

I get the following error

--------------------------------------------------------------------------- ArrowInvalid Traceback (most recent call last) /var/folders/x6/00594j4s2yv3swcn98bn8gxr0000gn/T/ipykernel_95780/4128956270.py in <module> ----> 1 z=t.combine_chunks()
~/.venv/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.ChunkedArray.combine_chunks() 
~/.venv/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.concat_arrays() ~/Documents/Github/dataquality/.venv/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status() ~.venv/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status() 
ArrowInvalid: offset overflow while concatenating arrays 

With smaller strings or smaller arrays this works fine.

x = str(np.random.randint(low=0,high=1000, size=(10,)).tolist())
t = pa.chunked_array([x]*1000)
combined = t.combine_chunks()

The first example that fails takes a few minutes to run. If you’d like a faster example for experimentation, you can use vaex to generate the chunked array much faster. This will throw the identical error and will run about 1 second.

import vaex
import numpy as np

n = 50_000
x = str(np.random.randint(low=0,high=1000, size=(30_000,)).tolist())
df = vaex.from_arrays(
    id=list(range(n)),
    y=np.random.randint(low=0,high=1000,size=n)
)
df["text"] = vaex.vconstant(x, len(df))
# text_chunk_array is now a pyarrow.lib.ChunkedArray
text_chunk_array = df.text.values
x = text_chunk_array.combine_chunks() 

Reporter: Ben Epstein

Related issues:

Note: This issue was originally created as ARROW-17828. Please see the migration documentation for further details.

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 24 (4 by maintainers)

Most upvoted comments

which only occurs when vaex is forced to do a df.take on rows which contain a string column whose unfiltered in memory representation is larger than 2GB.

That sounds like #25822

It looked to me very much like a 32 bit pointer limitation issue.

Yes, it essentially is. We use 32-bit offsets for the default string and binary types. This means you always have to be careful when generating the arrays to chunk them if they get too large. Or otherwise use the LargeString variant, which uses 64-bit offsets. Polars, for example, takes the latter approach; they always use LargeString.

Basically the bug with take and concatenation is that we aren’t automatically handling the chunking for you. But you are right that essentially the same issue shows up in multiple places.

I’ll ping @maartenbreddels to get deeper opinion

@Ben-Epstein yup its really only a workaround that’s applicable to cases where you need the data in memory after filtering, and since vaex is meant to be lazy execution memory map as required a lot of folks will be encountering the problem in other contexts. Its certainly possible its a take explosion; the context in this case is that I’m working with filterset resulting from a df.take on an already potentially filtered set of indices of rows identified as having matching report ids due to the lack of indexing and a many to one record relationship between two tables that prevents a join from being used. Column failing are free text reports relating to leiomyosarcoma cases, so there’s less than 100 scattered throughout this table of >2M reports that get filtered via a regex query on a MedDRA term. Its possible the take is being multiplied across different tables/arrays from the different hdf5 files in the dataset multiplied by the separate chunks of those files and just creating a polynomial complexity, but I’m not familiar enough yet with the vaex internals to confirm that. As you figured out there the Dataframe take vs arrow take and the code complexity makes it a little challenging to debug. I’ll be able to look more under the hood of whats going on in a couple of days.

@leprechaunt33 thanks for checking back in. Since this same issue happened in huggingface datasets, it would make me believe that it really is the .take explosion.

That’s certainly an interesting workaround, but not one I can use in production 😃

@leprechaunt33 I was working on a PR in vaex to remove the use of .take because in fact that is the issue coming from arrow (huggingface datasets faced the same issue). Details are here https://github.com/vaexio/vaex/pull/2336

See the related issue referenced in the PR for more details

Current work around I’ve developed for vaex in general with this pyarrow related error on dataframes for which the technique mentioned above does not work (for materialisation of pandas array from a joined multi file data frame where I was unable to set the arrow data type on the column):

  • catch the ArrowInvalid exception, create blank pandas data frame with required columns and iterate the columns in the vaex data frame to materialize them one by one within the pandas df.
  • If ArrowInvalid is caught again, evaluate the rogue column with an evaluate_iterator() with prefetch and suitable chunk_size that will not exceed the bounds of the string, working off maximum expected record size, and collate the pyarrow StringArray/ChunkedArray data
  • Continue iterating columns, typically only one or two columns will need the chunked treatment.