arrow: [C++][Python] Large strings cause ArrowInvalid: offset overflow while concatenating arrays
When working with medium-sized datasets that have very long strings, arrow fails when trying to operate on the strings. The root is the combine_chunks
function.
Here is a minimally reproducible example
import numpy as np
import pyarrow as pa
# Create a large string
x = str(np.random.randint(low=0,high=1000, size=(30000,)).tolist())
t = pa.chunked_array([x]*20_000)
# Combine the chunks into large string array - fails
combined = t.combine_chunks()
I get the following error
--------------------------------------------------------------------------- ArrowInvalid Traceback (most recent call last) /var/folders/x6/00594j4s2yv3swcn98bn8gxr0000gn/T/ipykernel_95780/4128956270.py in <module> ----> 1 z=t.combine_chunks()
~/.venv/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.ChunkedArray.combine_chunks()
~/.venv/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.concat_arrays() ~/Documents/Github/dataquality/.venv/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status() ~.venv/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowInvalid: offset overflow while concatenating arrays
With smaller strings or smaller arrays this works fine.
x = str(np.random.randint(low=0,high=1000, size=(10,)).tolist())
t = pa.chunked_array([x]*1000)
combined = t.combine_chunks()
The first example that fails takes a few minutes to run. If you’d like a faster example for experimentation, you can use vaex
to generate the chunked array much faster. This will throw the identical error and will run about 1 second.
import vaex
import numpy as np
n = 50_000
x = str(np.random.randint(low=0,high=1000, size=(30_000,)).tolist())
df = vaex.from_arrays(
id=list(range(n)),
y=np.random.randint(low=0,high=1000,size=n)
)
df["text"] = vaex.vconstant(x, len(df))
# text_chunk_array is now a pyarrow.lib.ChunkedArray
text_chunk_array = df.text.values
x = text_chunk_array.combine_chunks()
Reporter: Ben Epstein
Related issues:
- https://github.com/apache/arrow/issues/25822 (relates to)
- https://github.com/apache/arrow/issues/28850 (relates to)
- https://github.com/apache/arrow/issues/23539 (relates to)
- https://github.com/apache/arrow/issues/26180 (supercedes)
Note: This issue was originally created as ARROW-17828. Please see the migration documentation for further details.
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 24 (4 by maintainers)
That sounds like #25822
Yes, it essentially is. We use 32-bit offsets for the default string and binary types. This means you always have to be careful when generating the arrays to chunk them if they get too large. Or otherwise use the
LargeString
variant, which uses 64-bit offsets. Polars, for example, takes the latter approach; they always useLargeString
.Basically the bug with take and concatenation is that we aren’t automatically handling the chunking for you. But you are right that essentially the same issue shows up in multiple places.
I’ll ping @maartenbreddels to get deeper opinion
@Ben-Epstein yup its really only a workaround that’s applicable to cases where you need the data in memory after filtering, and since vaex is meant to be lazy execution memory map as required a lot of folks will be encountering the problem in other contexts. Its certainly possible its a take explosion; the context in this case is that I’m working with filterset resulting from a df.take on an already potentially filtered set of indices of rows identified as having matching report ids due to the lack of indexing and a many to one record relationship between two tables that prevents a join from being used. Column failing are free text reports relating to leiomyosarcoma cases, so there’s less than 100 scattered throughout this table of >2M reports that get filtered via a regex query on a MedDRA term. Its possible the take is being multiplied across different tables/arrays from the different hdf5 files in the dataset multiplied by the separate chunks of those files and just creating a polynomial complexity, but I’m not familiar enough yet with the vaex internals to confirm that. As you figured out there the Dataframe take vs arrow take and the code complexity makes it a little challenging to debug. I’ll be able to look more under the hood of whats going on in a couple of days.
@leprechaunt33 thanks for checking back in. Since this same issue happened in huggingface datasets, it would make me believe that it really is the .take explosion.
That’s certainly an interesting workaround, but not one I can use in production 😃
@leprechaunt33 I was working on a PR in vaex to remove the use of
.take
because in fact that is the issue coming from arrow (huggingface datasets faced the same issue). Details are here https://github.com/vaexio/vaex/pull/2336See the related issue referenced in the PR for more details
Current work around I’ve developed for vaex in general with this pyarrow related error on dataframes for which the technique mentioned above does not work (for materialisation of pandas array from a joined multi file data frame where I was unable to set the arrow data type on the column):
Could you add a comment that contains only “take” like https://github.com/apache/arrow/issues/33849#issuecomment-1423372616 to here? See also: https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment