cudf: [BUG] Compressing a table with large strings using ZSTD can result in little or no compression
Describe the bug A spark internal customer tried using zstd compression on the gpu in a 22.12 snapshot release and reported that they were getting no compression, while with cpu they were getting very good compression. Using the first 100,000 rows of one of their tables, I got:
255M 2022-01-02-cpu-none
18M 2022-01-02-cpu-zstd
255M 2022-01-02-gpu-zstd
I was also able to repro with 100 rows, and with parquet-tools, I could see that most of the columns were uncompressed in the GPU version, in particular this one:
ColumnChunk
meta_data = ColumnMetaData
type = 6
encodings = list
0
3
path_in_schema = list
data
num_values = 100
total_uncompressed_size = 250217
total_compressed_size = 250217
That data column contained strings of variable length which were all around 2500 characters long. Each string was a json structure with the same set of fields with differing values - so there were a lot of common characters. The problem is that these columns were going over the 64KB limit for zstd, so the parquet writer was falling back to uncompressed.
Snappy does not appear to have this problem.
Steps/Code to reproduce bug I was able to reproduce this by generating a table with 32 rows of strings, where each string consisted of random strings 64 characters long followed by an 8-character string repeated 248 times. I will attach a parquet file that reproduces the problem if you read it in and then write it out with zstd compression. These were the results I got with 32 rows:
80K test-data-32-cpu-none
16K test-data-32-cpu-zstd
84K test-data-32-gpu-zstd
And this is what I got with 31 rows (which keeps it under the 64KB limit):
76K test-data-31-cpu-none
16K test-data-31-cpu-zstd
20K test-data-31-gpu-zstd
Expected behavior When you compress a file with zstd using the gpu, it should provide some compression, ideally comparable to the CPU.
Environment overview (please complete the following information) I tested this with Spark using a snapshot of the Spark-rapids plugin running on a 22.12 cuDF snapshot.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 18 (9 by maintainers)
I generated a similar sample with 10000 rows. Sizes are:
repro-data-10000-rows.tgz
The idea is to derive fragment size per column, based on the column data size. But it will take a bit to implement 😃
I have tested this with #12211, and it did not resolve it. In the customer case, the files are much larger, and the resulting row groups are much bigger as well.