arrow-rs: Bug?: Writer with EnabledStatistics::Page on large arrays consumer 10x more memory
Describe the bug When write Parquet using arrow-rs with large array columns, memory consumption is 10x larger in case EnabledStatistics is EnabledStatistics::Page. The schema of Parquet file has the following fields:
- timestamp, UInt64
- num_points, UInt32
- x, Array[Float32], size is 250 000
- y, Array[Float32], size is 250 000
- z, Array[Float32], size is 250 000
- intensity, Array[UInt8], size is 250 000
- ring, Array[UInt8], size is 250 000
To Reproduce
- Fork https://github.com/REASY/parquet-example-rs
- Run
cargo build --release && /usr/bin/time -pv target/release/parquet-example-rs --output-parquet-folder output --rows 8000 --statistics-mode page
- Check
Maximum resident set size (kbytes)
from /usr/bin/time
Curiously, if I run DHAT memory profiler, I do not see much difference in memory consumption, https://github.com/REASY/parquet-example-rs#memory-profiler
Run valgrind to check memory related issue, it takes ~4 hours, but nothing was reported:
➜ parquet-example-rs git:(main) ✗ cargo install cargo-valgrind ➜ parquet-example-rs git:(main) ✗ cargo valgrind run -- --output-parquet-folder output --rows 1000 --statistics-mode page
Compiling parquet-example-rs v0.1.0 (/home/artavazd.balaian/work/github/REASY/parquet-example-rs)
Finished dev [unoptimized + debuginfo] target(s) in 2.14s
Running `/home/artavazd.balaian/.cargo/bin/cargo-valgrind target/debug/parquet-example-rs --output-parquet-folder output --rows 1000 --statistics-mode page`
Received args: AppArgs { output_parquet_folder: "output", rows: 1000, statistics_mode: Page }
Processed 500 msgs with throughout 0.073 msg/s
Processed 1000 msgs with throughout 0.073 msg/s
Wrote 1000 Lidar Point Cloud to parquet in 13759.010 seconds, average throughput 0.073 msg/s
When I trace the code, the only place where EnabledStatistics::Page
is used is in https://github.com/apache/arrow-rs/blob/1d6feeacebb8d0d659d493b783ba381940973745/parquet/src/column/writer/encoder.rs#L139-L144 and not clear how it can cause so much allocation.
Expected behavior
Additional context Dependencies:
arrow = "47"
clap = { version = "4", features = ["derive"] }
dhat = "0.3.2"
once_cell = "1.18.0"
parquet = "47"
rand = "0.8"
The comparison between three mode of statistics is done https://github.com/REASY/parquet-example-rs#page-statistics-consume-10x-more-memory-when-write-8000-rows, for the same number of rows check the table below:
Statistics mode | Number of rows | Total time, seconds | CPU usage, % | Average throughput, rows/s | Maximum resident set size, Mbytes | Output Parquet size, Mbytes |
---|---|---|---|---|---|---|
None | 8000 | 113.124 | 96 | 70.719 | 752.67 | 38148.34 |
Chunk | 8000 | 128.318 | 97 | 62.345 | 790.96 | 38148.37 |
Page | 8000 | 130.53 | 98 | 61.301 | 8516.36 | 38148.88 |
Even though writing to Page stats uses more 10x more memory, in terms of the file size (it is not compressed), the Page one is only 548.2 Kbytes larger than None.
About this issue
- Original URL
- State: open
- Created 8 months ago
- Comments: 20 (8 by maintainers)
I can’t seem to reproduce this, using heaptrack
$ heaptrack ./target/release/parquet-example-rs --output-parquet-folder output --rows 1500 --statistics-mode page
vs
$ heaptrack ./target/release/parquet-example-rs --output-parquet-folder output --rows 1500 --statistics-mode none
Both show the sawtooth pattern I would expect as it creates batches, and then flushes them to disk
Perhaps try https://github.com/KDE/heaptrack
Yes it will accumulate per page per row group. You could also try increasing the page size for that column, but I can’t help feeling this data is ill-fitted for a general-purpose analytics data format…