arrow-rs: Bug?: Writer with EnabledStatistics::Page on large arrays consumer 10x more memory

Describe the bug When write Parquet using arrow-rs with large array columns, memory consumption is 10x larger in case EnabledStatistics is EnabledStatistics::Page. The schema of Parquet file has the following fields:

  • timestamp, UInt64
  • num_points, UInt32
  • x, Array[Float32], size is 250 000
  • y, Array[Float32], size is 250 000
  • z, Array[Float32], size is 250 000
  • intensity, Array[UInt8], size is 250 000
  • ring, Array[UInt8], size is 250 000

To Reproduce

  • Fork https://github.com/REASY/parquet-example-rs
  • Run cargo build --release && /usr/bin/time -pv target/release/parquet-example-rs --output-parquet-folder output --rows 8000 --statistics-mode page
  • Check Maximum resident set size (kbytes) from /usr/bin/time

Curiously, if I run DHAT memory profiler, I do not see much difference in memory consumption, https://github.com/REASY/parquet-example-rs#memory-profiler

Run valgrind to check memory related issue, it takes ~4 hours, but nothing was reported:

➜  parquet-example-rs git:(main) ✗ cargo install cargo-valgrind                                                                                                                             ➜  parquet-example-rs git:(main) ✗ cargo valgrind run  -- --output-parquet-folder output --rows 1000 --statistics-mode page                                                                 
   Compiling parquet-example-rs v0.1.0 (/home/artavazd.balaian/work/github/REASY/parquet-example-rs)
    Finished dev [unoptimized + debuginfo] target(s) in 2.14s
     Running `/home/artavazd.balaian/.cargo/bin/cargo-valgrind target/debug/parquet-example-rs --output-parquet-folder output --rows 1000 --statistics-mode page`
Received args: AppArgs { output_parquet_folder: "output", rows: 1000, statistics_mode: Page }
Processed 500 msgs with throughout 0.073 msg/s
Processed 1000 msgs with throughout 0.073 msg/s
Wrote 1000 Lidar Point Cloud to parquet in 13759.010 seconds, average throughput 0.073 msg/s

When I trace the code, the only place where EnabledStatistics::Page is used is in https://github.com/apache/arrow-rs/blob/1d6feeacebb8d0d659d493b783ba381940973745/parquet/src/column/writer/encoder.rs#L139-L144 and not clear how it can cause so much allocation.

Expected behavior

Additional context Dependencies:

arrow = "47"
clap = { version = "4", features = ["derive"] }
dhat = "0.3.2"
once_cell = "1.18.0"
parquet = "47"
rand = "0.8"

The comparison between three mode of statistics is done https://github.com/REASY/parquet-example-rs#page-statistics-consume-10x-more-memory-when-write-8000-rows, for the same number of rows check the table below:

Statistics mode Number of rows Total time, seconds CPU usage, % Average throughput, rows/s Maximum resident set size, Mbytes Output Parquet size, Mbytes
None 8000 113.124 96 70.719 752.67 38148.34
Chunk 8000 128.318 97 62.345 790.96 38148.37
Page 8000 130.53 98 61.301 8516.36 38148.88

Even though writing to Page stats uses more 10x more memory, in terms of the file size (it is not compressed), the Page one is only 548.2 Kbytes larger than None.

About this issue

  • Original URL
  • State: open
  • Created 8 months ago
  • Comments: 20 (8 by maintainers)

Most upvoted comments

I can’t seem to reproduce this, using heaptrack

$ heaptrack ./target/release/parquet-example-rs --output-parquet-folder output --rows 1500 --statistics-mode page

image

vs

$ heaptrack ./target/release/parquet-example-rs --output-parquet-folder output --rows 1500 --statistics-mode none

image

Both show the sawtooth pattern I would expect as it creates batches, and then flushes them to disk

Yes it will accumulate per page per row group. You could also try increasing the page size for that column, but I can’t help feeling this data is ill-fitted for a general-purpose analytics data format…