duckdb: Slow conversion when loading from CSV and saving as Parquet

What happens?

I am loading several large datasets (20GB+ of CSV) data. I would like to export them to Parquet format for subsequent processing.

When I load the data, it loads from CSV into a local on-disk database in something like 30 seconds (followed by maybe another two minutes or so, approximately, to finish loading or doing some internal indexing). When I then export the data to a Parquet file, it is taking upwards of 3 hours to write the file in Parquet format.

I am on an Apple Silicon M1 iMac with 16GB of RAM and 1TB of SSD storage.

Is there a better way to do this than the steps I am following below?

To Reproduce

From the shell

rm -f edw.duck # remove any prior database file
duckdb_cli edw.duck # create new db and start the console

From the `duckdb` console

create or replace table ccaed182 as select * from read_csv_auto('ccaed182.csv.gz', all_varchar=true, header=true);
copy ccaed182 to 'ccaed182.parquet' (format parquet);

Approximate times

~ 3 minutes - to load the CSV file from disk
~ 3 hours - to save the Parquet file to disk

OS:

macOS Ventura 13.2.1

DuckDB Version:

0.7.1

DuckDB Client:

duckdb_cli

Full Name:

Steve Shreeve

Affiliation:

Independent developer

Have you tried this on the latest `master` branch?

I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

I agree

About this issue

Original URL
State: closed
Created a year ago
Comments: 45 (32 by maintainers)

Most upvoted comments

Thanks again for the investigative work and sharing the files - this has been incredibly helpful! The high memory usage leading to the OOM killer should be fixed by #7253.

Mytherin on Apr 26, 2023

Just to close the loop on this… here are the final numbers. I would call this a slam dunk by @pdet and @Mytherin:

stats

# compressed source file on disk
time ~/bin/duckdb <<<"set preserve_insertion_order=true ; copy (select * from read_csv_auto('ccaed182.csv.gz', all_varchar=true)) to 'ccaed182-26.parquet' (format 'parquet', codec 'zstd');" # 4.14 GB  794.47s user  35.77s system  99% cpu 13:52.37 total
time ~/bin/duckdb <<<"set preserve_insertion_order=true ; copy                              'ccaed182.csv.gz'                     to 'ccaed182-22.parquet' (format 'parquet', codec 'zstd');" # 4.79 GB  632.53s user  29.95s system  99% cpu 11:03.74 total
time ~/bin/duckdb <<<"set preserve_insertion_order=false; copy (select * from read_csv_auto('ccaed182.csv.gz', all_varchar=true)) to 'ccaed182-24.parquet' (format 'parquet', codec 'zstd');" # 4.22 GB 1145.16s user  26.05s system 576% cpu  3:23.13 total
time ~/bin/duckdb <<<"set preserve_insertion_order=false; copy                              'ccaed182.csv.gz'                     to 'ccaed182-20.parquet' (format 'parquet', codec 'zstd');" # 4.82 GB  899.18s user  26.96s system 522% cpu  2:57.35 total

For reference, this is on-the-fly decompression of CSV files, reading those with a parallelized CSV reader, and then converted the output to compressed parquet files (which will will save directly to S3)… all in ONE pass! Amazing.

shreeve on Apr 26, 2023

An absolutely astonishing improvement!

v0.7.2-dev1322 4161f39ca1
Enter ".help" for usage hints.
D set preserve_insertion_order=false;
D .timer on
D create or replace table ccaed182 as select * from read_csv_auto('ccaed182.csv.gz', all_varchar=true, header=true);
100% ▕████████████████████████████████████████████████████████████▏
Run Time (s): real 962.424 user 952.173166 sys 7.611139

$ ls -l
-r--r--r--@ 1 shreeve  staff  6506608484 Jan 17 08:21 ccaed182.csv.gz
-rw-r--r--@ 1 shreeve  staff  4167902044 Apr  2 11:29 ccaed182.parquet

duckdb, ftw!

shreeve on Apr 2, 2023