duckdb: FSST string compression failed due to incorrect size calculation

What happens?

When trying to create a table like this

CREATE TABLE xxx AS SELECT tbl.*, '12345' AS dedup_group
                FROM read_parquet('path/glob/*.snappy.parquet') AS tbl;

I get the following error after a few dozen seconds

InternalException: INTERNAL Error: FSST string compression failed due to incorrect size calculation

To Reproduce

I’m guessing it’s somehow dependant on the parquet file that I’m trying to load in but sadly I can’t share the data due to privacy reasons. I’m happy to try and generate artificial data that exhibits the same problem but I need some help thinking of ideas what the issue might be so that I don’t waste time trying every possible combination.

The parquet files are about 80MB each and were generated from Spark (Scala).

Unfortunately that’s the only extra information I have available at the moment but again I’m happy to continue digging.

OS:

macOS

DuckDB Version:

0.6.1, 0.6.2dev447

DuckDB Client:

Python

Full Name:

Rik Nauta

Affiliation:

LMU AB

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 39 (12 by maintainers)

Most upvoted comments

🥇 Can confirm, that’s working. It also completed a lot more quickly?! I’ll try loading in the other data that was giving me the the UTF-8 invalid asserts to see if that’s fixed as well.

@RXminuS sent you an email, lets move this discussion there to reduce the noise a bit 😃

Oh derp! I cloned your repo but forgot to switch branches 🤦 Will rebuild now

@RXminuS ah could you run it with the duckdb cli instead? that would be:

./build/debug/duckdb <some path to where the db will be created>

Perhaps try running pip uninstall multiple times until it returns WARNING: Skipping duckdb as it is not installed. and then building from source? pip tends to keep multiple versions of the system around and it can lead to the wrong version being used by accident.

@RXminuS I haven’t managed to reproduce this with a bunch of random data so I would propose the following:

I made a branch at https://github.com/samansmink/duckdb/tree/instrumented-fsst-compression where I added a bunch of print statements and some extra checks on the relevant variables. Could you rerun your query on the offending column and send me the output? If you have any questions also feel free to also reach out through the duckdb discord or to me directly: ‘Sam Ansmink#3611’

If that still fails I think a pair debugging session would be our best bet to catch this

Yeah I know which top level column it is, however it’s a nested struct so there’s a bunch of different sized arrays and stuff in there. I can try selecting only sub-columns and seeing if I can narrow it down further.

Just FYI, I’m still investigating. Just been a bit busy around the holidays.

Thanks for the report, but can you please try to make it reproducible by creating a dataset you can share?

Absolutely! I felt really bad opening the issue with so little information but I also hoped that having at least the error message up here might bring other people who unknowingly are experiencing the same and Googling for it. The message first presented itself through SQLAlchemy / ibis and I’ve had issues with parquet in the past as well. So since the error message is very generic it took me a while to bring it back to DuckDB instead of one of the other components involved.

But I’ll do my best to isolate the datapoint as per suggestions in this thread and really appreciate any help with honing down the issue and patience to get there.

P.S. Also, I’d be remiss if I didn’t at least give a massive shoutout to DuckDB…It’s quacking awesome! ❤️