arrow: Reading FixedSizeList from parquet is slower than reading values into more rows

Describe the bug, including details regarding any error messages, version, and platform.

I’m not 100% sure if it’s a bug, but I don’t understand the differences between the two cases: Nested arrays:

table with 30 columns, 2 level "index" (so 2 columns) and 29 FixedSizeList<double, 80> columns, 100k rows

Exploded:

table with 31 columns, 3 level "index" (so 3 columns) and 29 double columns, 8000k rows

Reading the first table from parquet (version 2.6, zstd compression, single file) is surprisingly slower than reading the second table. I’d assume it’s the same task, a few columns are even shorter. The file sizes are almost equal.

I used pyarrow 11 from conda and local SSD.

Component(s)

Parquet Python

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 16 (16 by maintainers)

Most upvoted comments

why DELTA_BINARY_PACKED is deeply flawed

The paper they link to actually explains why the approach is problematic - http://arxiv.org/pdf/1209.2137v5.pdf. The whole paper is on why not to implement delta compression in this way 😂

@alamb @tustvold I saw your blog post about this for arrow-rs. Do you handle this differently in Rust?

We don’t support FixedSizeList in arrow-rs AFAIK. Parquet to my knowledge does not have an equivalent logical construct, and so it isn’t particularly clear to me what support would mean other than implicitly casting between a regular list and a fixed size list.

Calculating the def and rep levels for 100k rows with all non-null values takes 2x time as reading 8M doubles

Assuming the doubles are PLAIN encoded this is not surprising to me, you are comparing the performance of what is effectively a memcpy that will run at the memory bandwidth, to a fairly complex bit-packing scheme used for the definition and repetition levels.

In the Rust implementation we have a couple of tricks that help here, but it is still relatively expensive (at least compared to primitive decoding):

  • We decode definition levels directly to the null buffer if there are only nulls at the leaf level (i.e. no lists or nested nulls), allowing us to preserve the bit-packing
  • We have vectorised unpack implementations specialised for each bit width (I believe arrow C++ does this also)

Definition level data is all 1 and supposed to be RLE encoded

It will actually all be 2, unless the doubles are themselves not nullable

Repetition level data is a vector of 0 followed by 79x1 repeated 100k times for our case. I’m not sure if RLE will help here, sounds like an unnecessary complex structure for fixed size lists

These repetition levels will be RLE encoded. Theoretically a reader could preserve this, but the record shredding logic is extremely fiddly and so might run the risk of adding complexity to an already very complex piece of code. At least in arrow-rs we always decode repetition levels to an array of i16

and it could be optimized to use a single static value, right?

Yes, in the future, developer may optimize it. If FixedSizeArray is non-nullable, Parquet can have a single static value, but if FixedSizeArray is non-nullable, it cannot.

other reasons or replevels cascade somehow into even worse perf

I’ve profile the C++ part, in my MacOS with release (O2):

C811842B-4DDA-423A-B8A1-EC6A7E4ADE33
  1. Decoding double is fast
  2. Decoding levels use nearly same time as Decoding double
  3. Constructing List cost a little time

The benchmark uses list rather than FixedSizeList, but I think the benchmark is similiar.

I’m not so familiar with Python part, maybe someone can profile that path

The same happens with not null values (I’m not sure how to define the not null list correctly, but looks like it doesn’t matter):

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

arr_random = np.random.default_rng().standard_normal(size=[8000000], dtype='float64')
arr1 = pa.array(arr_random)
arr2 = pa.FixedSizeListArray.from_arrays(arr_random, 80)
t1 = pa.Table.from_arrays([arr1], schema=pa.schema([('A', pa.float64(), False)]))
t2 = pa.Table.from_arrays([arr2], schema=pa.schema([('A', pa.list_(pa.field('A', pa.float64(), False), 80), False)]))
t3 = pa.Table.from_arrays([arr2], schema=pa.schema([pa.field('A', pa.list_(pa.float64(), 80), False)]))

pq.write_table(t1, 't1.parquet')
pq.write_table(t2, 't2.parquet')
pq.write_table(t3, 't3.parquet')

%%timeit

t1 = pq.read_table('t1.parquet') # 30ms

%%timeit

t2 = pq.read_table('t2.parquet') # 100ms

%%timeit

t3 = pq.read_table('t3.parquet') # 100ms
print(t1.get_total_buffer_size(), t2.get_total_buffer_size(), t3.get_total_buffer_size()) # (64000000, 64000000, 64000000)
print(t1.schema, t2.schema, t3.schema)
# (A: double not null,
# A: fixed_size_list<A: double not null>[80] not null
#   child 0, A: double not null,
# A: fixed_size_list<item: double>[80] not null
#   child 0, item: double)