cudf: [BUG] Compressing a table with large strings using ZSTD can result in little or no compression

Describe the bug A spark internal customer tried using zstd compression on the gpu in a 22.12 snapshot release and reported that they were getting no compression, while with cpu they were getting very good compression. Using the first 100,000 rows of one of their tables, I got:

255M	2022-01-02-cpu-none
18M	2022-01-02-cpu-zstd
255M	2022-01-02-gpu-zstd

I was also able to repro with 100 rows, and with parquet-tools, I could see that most of the columns were uncompressed in the GPU version, in particular this one:

                ColumnChunk
                    meta_data = ColumnMetaData
                        type = 6
                        encodings = list
                            0
                            3
                        path_in_schema = list
                            data
                        num_values = 100
                        total_uncompressed_size = 250217
                        total_compressed_size = 250217

That data column contained strings of variable length which were all around 2500 characters long. Each string was a json structure with the same set of fields with differing values - so there were a lot of common characters. The problem is that these columns were going over the 64KB limit for zstd, so the parquet writer was falling back to uncompressed.

Snappy does not appear to have this problem.

Steps/Code to reproduce bug I was able to reproduce this by generating a table with 32 rows of strings, where each string consisted of random strings 64 characters long followed by an 8-character string repeated 248 times. I will attach a parquet file that reproduces the problem if you read it in and then write it out with zstd compression. These were the results I got with 32 rows:

80K	test-data-32-cpu-none
16K	test-data-32-cpu-zstd
84K	test-data-32-gpu-zstd

And this is what I got with 31 rows (which keeps it under the 64KB limit):

76K	test-data-31-cpu-none
16K	test-data-31-cpu-zstd
20K	test-data-31-gpu-zstd

Expected behavior When you compress a file with zstd using the gpu, it should provide some compression, ideally comparable to the CPU.

Environment overview (please complete the following information) I tested this with Spark using a snapshot of the Spark-rapids plugin running on a 22.12 cuDF snapshot.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 18 (9 by maintainers)

Most upvoted comments

I generated a similar sample with 10000 rows. Sizes are:

 20M	cpu-10000-data.parquet
504K	cpu-10000-data.zstd.parquet
 20M	gpu-10000-data.zstd.parquet

repro-data-10000-rows.tgz

The idea is to derive fragment size per column, based on the column data size. But it will take a bit to implement 😃

I have tested this with #12211, and it did not resolve it. In the customer case, the files are much larger, and the resulting row groups are much bigger as well.