arrow-rs: AsyncArrowWriter doesn't limit underlying ArrowWriter to respect buffer-size
AsyncArrowWriter
created with default WriterProperties
will have a config of DEFAULT_MAX_ROW_GROUP_SIZE = 1024 * 1024
It means that underlying ArrowWriter
won’t flush to disk until it’s reached.
It leads to the incredible memory consumption, because it will cache up to DEFAULT_MAX_ROW_GROUP_SIZE
(1048576 by default) and will ignore buffer_capacity
config at all.
Because the flushing condition of sync writer is:
if in_progress.buffered_rows >= self.max_row_group_size {
self.flush()?
}
To Reproduce
Try to write many large rows to parquet with AsyncArrowWriter
, you will see the memory consumption doesn’t respect buffer size.
UPD: MRE was created https://github.com/apache/arrow-rs/issues/5450#issuecomment-1973921969
Expected behavior Perfectly, it should respect buffer config. I.e flush on either buffer size or max row group is reached.
But even if it’s expected for some reason, documentation should clearly highlight that.
Additional context
Btw, why default is 1024 * 1024
? Like it’s byte unites
About this issue
- Original URL
- State: closed
- Created 4 months ago
- Comments: 18 (10 by maintainers)
Commits related to this issue
- Document parquet writer memory limiting (#5450) — committed to tustvold/arrow-rs by tustvold 4 months ago
- Document parquet writer memory limiting (#5450) (#5457) * Document parquet writer memory limiting (#5450) * Review feedback * Review feedback — committed to apache/arrow-rs by tustvold 4 months ago
I think we should start by documenting the current state of play and go from there, I’ll try to get something up later today. It may be we can get by with just an example showing how to limit memory usage.
Yeah, that totally makes sense.
Well, it definitely was
AsyncArrowWriter
. Once I decreased max row group usage became normal - but still it caches up to row number limit ignoring any buffer limits.And also I tried not to use arrow writer at all, but write to disk directly in streaming manner as is - there were no issues.
I think I can provide MRE easily. It will be easier to profile
The structure of parquet forces us to buffer an entire row group before we can flush it. The async writer should do a better job of calling this out
Something is wrong here, it should only consume up to 10Mb, perhaps you could use a memory profiler to identify where the usage is coming from