arrow: [Python] [Parquet] Compression degradation when column type changed from INT64 to INT32
Describe the bug, including details regarding any error messages, version, and platform.
Within a CSV of ~17M rows, I have a column of unique integers that are fairly uniformly distributed between 0 and 200,000,000. I am reading the CSV as follows:
from pyarrow import csv, parquet
def file_to_data_frame_to_parquet(local_file: str, parquet_file: str) -> None:
table = csv.read_csv(local_file, convert_options=csv.ConvertOptions(strings_can_be_null=True))
parquet.write_table(table, parquet_file, compression='zstd')
When I read the column without any type specification, the uncompressed size is 133.1MB, and the compressed size is 18.0 MB.
When I add an explicit type mapping for that column in the read_csv
step of either uint32
or int32
, the total uncompressed size shrinks to 67.0 MB, but the compressed size expands to 55.8 MB. (I’m getting these statistics from the parquet schema metadata functions in DuckDB, but I’ve validated the difference is real from the total file size.)
This degradation stays the same with a variety of different changes to settings/envs:
- pyarrow 11.0.0 and 12.0.0
- MacOS 13.4 and Ubuntu 20.04
- ZSTD and GZIP compression (GZIP performs better than ZSTD but the degradation is still there)
- explicitly expanding row groups/write batches to 1GB
- Versions 1.0, 2.4, 2.6, data page versions 1.0 and 2.0
- Dictionaries enabled/disabled
- Dictionary page sizes of 1KB, 1MB, 1GB
- When the table is sorted sequentially by the column or when it is sorted at random
Component(s)
Parquet, Python
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 21 (15 by maintainers)
By the way, in DeltaBinaryPacked, I guess INT64 is better because of my patch in 12.0: https://github.com/apache/arrow/pull/34632