arrow: [C++][Parquet] Parquet write_to_dataset performance regression
Describe the bug, including details regarding any error messages, version, and platform.
On Linux, pyarrow.parquet.write_to_dataset
shows a large performance regression in Arrow 12.0 versus 11.0.
The following results were collected using Ubuntu 22.04.2 LTS (5.15.0-71-generic), Intel Haswell 4-core @ 3.6GHz, 16 GB RAM, Samsung 840 Pro SSD. They are elapsed times in seconds to write a single int64 column of integers [0,…, length-1] with no compression and no multi-threading:
Array length | Arrow 11 (s) | Arrow 12 (s) |
---|---|---|
1,000,000 | 0.1 | 0.1 |
2,000,000 | 0.2 | 0.4 |
4,000,000 | 0.3 | 1.6 |
8,000,000 | 0.8 | 6.2 |
16,000,000 | 2.3 | 24.4 |
32,000,000 | 6.5 | 94.1 |
64,000,000 | 13.5 | 371.7 |
The output directory was deleted before each run.
"""check.py"""
import sys
import time
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
def main():
path = '/tmp/test.parquet'
length = 10_000_000 if len(sys.argv) < 2 else int(sys.argv[1])
table = pa.Table.from_arrays([pa.array(np.arange(length))], names=['A'])
t0 = time.perf_counter()
pq.write_to_dataset(
table, path, schema=table.schema, use_legacy_dataset=False, use_threads=False, compression=None
)
duration = time.perf_counter() - t0
print(f'{duration:.2f}s')
if __name__ == '__main__':
main()
Running git bisect
on local builds leads me to this commit: 660d259f525d301f7ff5b90416622698fa8a5e9c: [C++] Add ordered/segmented aggregation Substrait extension (#34627).
Following that change, Flamegraphs show a lot of additional time spent in arrow::util::EnsureAlignment
calling glibc memcpy
:
Before ~1.3s (ddd0a337174e57cdc80b1ee30dc7e787acfc09f6)
After ~9.6s (660d259f525d301f7ff5b90416622698fa8a5e9c)
Reading and pyarrow.parquet.write_table
appear unaffected.
Component(s)
C++, Parquet
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 2
- Comments: 17 (17 by maintainers)
Commits related to this issue
- GH-35498: [C++] Fix source node batch realignment — committed to rtpsw/arrow by rtpsw a year ago
- GH-35498: [C++] Remove alignment in source node — committed to jorisvandenbossche/arrow by jorisvandenbossche a year ago
- GH-35498: [C++] Relax EnsureAlignment check in Acero from requiring 64-byte aligned buffers to requiring value-aligned buffers (#35565) ### Rationale for this change Various compute kernels and Acer... — committed to apache/arrow by westonpace a year ago
- GH-35498: [C++] Relax EnsureAlignment check in Acero from requiring 64-byte aligned buffers to requiring value-aligned buffers (#35565) Various compute kernels and Acero internals rely on type punnin... — committed to apache/arrow by westonpace a year ago
Looking at the code, I suspect the reason for degraded performance is because the source table has misaligned numpy arrays and each batch of each of these arrays get realigned by
EnsureAlignment
, since the aligned default batch size leads the batch-slicing to preserve misalignment. This can explain why the performance degradation gets worse with larger arrays that get sliced to more batches. One way to test this theory is to increase the batch size in line with the array sizes - the performance degradation is expected to be reduced.As for a cause of the problem, it looks like
pa.Table.from_arrays([pa.array(np.arange(length))], names=['A'])
results in per-Arrow misaligned arrays, due to zero-copy-wrapping of misaligned numpy arrays, which the Arrow spec forbids. However, since this code is natural and has likely been accepted since the beginning, the realignment should probably be done within Arrow (maybe with a warning), or be possible via Arrow configuration. The full arrays have a realignment performance cost, of course, but it should be much lower than many batches of each array have. Looking out further, I’d suggest considering adding facilities for getting per-Arrow aligned numpy arrays and documenting accordingly. If possible, better yet is to get numpy to support memory-alignment configuration, so that Arrow-user-code would not need to change.