arrow-rs: Bug?: Writer with EnabledStatistics::Page on large arrays consumer 10x more memory

Describe the bug When write Parquet using arrow-rs with large array columns, memory consumption is 10x larger in case EnabledStatistics is EnabledStatistics::Page. The schema of Parquet file has the following fields:

timestamp, UInt64
num_points, UInt32
x, Array[Float32], size is 250 000
y, Array[Float32], size is 250 000
z, Array[Float32], size is 250 000
intensity, Array[UInt8], size is 250 000
ring, Array[UInt8], size is 250 000

To Reproduce

Fork https://github.com/REASY/parquet-example-rs
Run cargo build --release && /usr/bin/time -pv target/release/parquet-example-rs --output-parquet-folder output --rows 8000 --statistics-mode page
Check Maximum resident set size (kbytes) from /usr/bin/time

Curiously, if I run DHAT memory profiler, I do not see much difference in memory consumption, https://github.com/REASY/parquet-example-rs#memory-profiler

Run valgrind to check memory related issue, it takes ~4 hours, but nothing was reported:

➜  parquet-example-rs git:(main) ✗ cargo install cargo-valgrind                                                                                                                             ➜  parquet-example-rs git:(main) ✗ cargo valgrind run  -- --output-parquet-folder output --rows 1000 --statistics-mode page                                                                 
   Compiling parquet-example-rs v0.1.0 (/home/artavazd.balaian/work/github/REASY/parquet-example-rs)
    Finished dev [unoptimized + debuginfo] target(s) in 2.14s
     Running `/home/artavazd.balaian/.cargo/bin/cargo-valgrind target/debug/parquet-example-rs --output-parquet-folder output --rows 1000 --statistics-mode page`
Received args: AppArgs { output_parquet_folder: "output", rows: 1000, statistics_mode: Page }
Processed 500 msgs with throughout 0.073 msg/s
Processed 1000 msgs with throughout 0.073 msg/s
Wrote 1000 Lidar Point Cloud to parquet in 13759.010 seconds, average throughput 0.073 msg/s

When I trace the code, the only place where EnabledStatistics::Page is used is in https://github.com/apache/arrow-rs/blob/1d6feeacebb8d0d659d493b783ba381940973745/parquet/src/column/writer/encoder.rs#L139-L144 and not clear how it can cause so much allocation.

Expected behavior

Additional context Dependencies:

arrow = "47"
clap = { version = "4", features = ["derive"] }
dhat = "0.3.2"
once_cell = "1.18.0"
parquet = "47"
rand = "0.8"

The comparison between three mode of statistics is done https://github.com/REASY/parquet-example-rs#page-statistics-consume-10x-more-memory-when-write-8000-rows, for the same number of rows check the table below:

Statistics mode	Number of rows	Total time, seconds	CPU usage, %	Average throughput, rows/s	Maximum resident set size, Mbytes	Output Parquet size, Mbytes
None	8000	113.124	96	70.719	752.67	38148.34
Chunk	8000	128.318	97	62.345	790.96	38148.37
Page	8000	130.53	98	61.301	8516.36	38148.88

Even though writing to Page stats uses more 10x more memory, in terms of the file size (it is not compressed), the Page one is only 548.2 Kbytes larger than None.

About this issue

Original URL
State: open
Created 8 months ago
Comments: 20 (8 by maintainers)

Most upvoted comments

I can’t seem to reproduce this, using heaptrack

$ heaptrack ./target/release/parquet-example-rs --output-parquet-folder output --rows 1500 --statistics-mode page

$ heaptrack ./target/release/parquet-example-rs --output-parquet-folder output --rows 1500 --statistics-mode none

Both show the sawtooth pattern I would expect as it creates batches, and then flushes them to disk

tustvold on Oct 22, 2023

Perhaps try https://github.com/KDE/heaptrack

tustvold on Oct 22, 2023

Yes it will accumulate per page per row group. You could also try increasing the page size for that column, but I can’t help feeling this data is ill-fitted for a general-purpose analytics data format…

tustvold on Oct 22, 2023