duckdb: Slow conversion when loading from CSV and saving as Parquet
What happens?
I am loading several large datasets (20GB+ of CSV) data. I would like to export them to Parquet format for subsequent processing.
When I load the data, it loads from CSV into a local on-disk database in something like 30 seconds (followed by maybe another two minutes or so, approximately, to finish loading or doing some internal indexing). When I then export the data to a Parquet file, it is taking upwards of 3 hours to write the file in Parquet format.
I am on an Apple Silicon M1 iMac with 16GB of RAM and 1TB of SSD storage.
Is there a better way to do this than the steps I am following below?
To Reproduce
From the shell
rm -f edw.duck # remove any prior database file
duckdb_cli edw.duck # create new db and start the console
From the duckdb console
create or replace table ccaed182 as select * from read_csv_auto('ccaed182.csv.gz', all_varchar=true, header=true);
copy ccaed182 to 'ccaed182.parquet' (format parquet);
Approximate times
~ 3 minutes - to load the CSV file from disk
~ 3 hours - to save the Parquet file to disk
OS:
macOS Ventura 13.2.1
DuckDB Version:
0.7.1
DuckDB Client:
duckdb_cli
Full Name:
Steve Shreeve
Affiliation:
Independent developer
Have you tried this on the latest master branch?
- I agree
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- I agree
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 45 (32 by maintainers)
Thanks again for the investigative work and sharing the files - this has been incredibly helpful! The high memory usage leading to the OOM killer should be fixed by #7253.
Just to close the loop on this… here are the final numbers. I would call this a slam dunk by @pdet and @Mytherin:
For reference, this is on-the-fly decompression of CSV files, reading those with a parallelized CSV reader, and then converted the output to compressed parquet files (which will will save directly to S3)… all in ONE pass! Amazing.
An absolutely astonishing improvement!
duckdb, ftw!