arrow: [Python] [Parquet] Compression degradation when column type changed from INT64 to INT32

Describe the bug, including details regarding any error messages, version, and platform.

Within a CSV of ~17M rows, I have a column of unique integers that are fairly uniformly distributed between 0 and 200,000,000. I am reading the CSV as follows:

from pyarrow import csv, parquet

def file_to_data_frame_to_parquet(local_file: str, parquet_file: str) -> None:
    table = csv.read_csv(local_file, convert_options=csv.ConvertOptions(strings_can_be_null=True))
    parquet.write_table(table, parquet_file, compression='zstd')

When I read the column without any type specification, the uncompressed size is 133.1MB, and the compressed size is 18.0 MB.

When I add an explicit type mapping for that column in the read_csv step of either uint32 or int32, the total uncompressed size shrinks to 67.0 MB, but the compressed size expands to 55.8 MB. (I’m getting these statistics from the parquet schema metadata functions in DuckDB, but I’ve validated the difference is real from the total file size.)

This degradation stays the same with a variety of different changes to settings/envs:

pyarrow 11.0.0 and 12.0.0
MacOS 13.4 and Ubuntu 20.04
ZSTD and GZIP compression (GZIP performs better than ZSTD but the degradation is still there)
explicitly expanding row groups/write batches to 1GB
Versions 1.0, 2.4, 2.6, data page versions 1.0 and 2.0
Dictionaries enabled/disabled
Dictionary page sizes of 1KB, 1MB, 1GB
When the table is sorted sequentially by the column or when it is sorted at random

Component(s)

Parquet, Python

About this issue

Original URL
State: open
Created a year ago
Comments: 21 (15 by maintainers)

Most upvoted comments

By the way, in DeltaBinaryPacked, I guess INT64 is better because of my patch in 12.0: https://github.com/apache/arrow/pull/34632

mapleFU on May 27, 2023