arrow-rs: AsyncArrowWriter doesn't limit underlying ArrowWriter to respect buffer-size

AsyncArrowWriter created with default WriterProperties will have a config of DEFAULT_MAX_ROW_GROUP_SIZE = 1024 * 1024

It means that underlying ArrowWriter won’t flush to disk until it’s reached. It leads to the incredible memory consumption, because it will cache up to DEFAULT_MAX_ROW_GROUP_SIZE (1048576 by default) and will ignore buffer_capacity config at all.

Because the flushing condition of sync writer is:

if in_progress.buffered_rows >= self.max_row_group_size {
    self.flush()?
}

To Reproduce Try to write many large rows to parquet with AsyncArrowWriter, you will see the memory consumption doesn’t respect buffer size.

UPD: MRE was created https://github.com/apache/arrow-rs/issues/5450#issuecomment-1973921969

Expected behavior Perfectly, it should respect buffer config. I.e flush on either buffer size or max row group is reached.

But even if it’s expected for some reason, documentation should clearly highlight that.

Additional context

Btw, why default is 1024 * 1024? Like it’s byte unites

About this issue

  • Original URL
  • State: closed
  • Created 4 months ago
  • Comments: 18 (10 by maintainers)

Commits related to this issue

Most upvoted comments

I think we should start by documenting the current state of play and go from there, I’ll try to get something up later today. It may be we can get by with just an example showing how to limit memory usage.

The structure of parquet forces us to buffer an entire row group before we can flush it.

Yeah, that totally makes sense.

Something is wrong here, it should only consume up to 10Mb, perhaps you could use a memory profiler to identify where the usage is coming from

Well, it definitely was AsyncArrowWriter. Once I decreased max row group usage became normal - but still it caches up to row number limit ignoring any buffer limits.

And also I tried not to use arrow writer at all, but write to disk directly in streaming manner as is - there were no issues.

I think I can provide MRE easily. It will be easier to profile

The structure of parquet forces us to buffer an entire row group before we can flush it. The async writer should do a better job of calling this out

It consumed 10gb of memory accordingly.

Something is wrong here, it should only consume up to 10Mb, perhaps you could use a memory profiler to identify where the usage is coming from