duckdb: Slow conversion when loading from CSV and saving as Parquet

What happens?

I am loading several large datasets (20GB+ of CSV) data. I would like to export them to Parquet format for subsequent processing.

When I load the data, it loads from CSV into a local on-disk database in something like 30 seconds (followed by maybe another two minutes or so, approximately, to finish loading or doing some internal indexing). When I then export the data to a Parquet file, it is taking upwards of 3 hours to write the file in Parquet format.

I am on an Apple Silicon M1 iMac with 16GB of RAM and 1TB of SSD storage.

Is there a better way to do this than the steps I am following below?

To Reproduce

From the shell

rm -f edw.duck # remove any prior database file
duckdb_cli edw.duck # create new db and start the console

From the duckdb console

create or replace table ccaed182 as select * from read_csv_auto('ccaed182.csv.gz', all_varchar=true, header=true);
copy ccaed182 to 'ccaed182.parquet' (format parquet);

Approximate times

~ 3 minutes - to load the CSV file from disk
~ 3 hours - to save the Parquet file to disk

OS:

macOS Ventura 13.2.1

DuckDB Version:

0.7.1

DuckDB Client:

duckdb_cli

Full Name:

Steve Shreeve

Affiliation:

Independent developer

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 45 (32 by maintainers)

Most upvoted comments

Thanks again for the investigative work and sharing the files - this has been incredibly helpful! The high memory usage leading to the OOM killer should be fixed by #7253.

Just to close the loop on this… here are the final numbers. I would call this a slam dunk by @pdet and @Mytherin:

stats

# compressed source file on disk
time ~/bin/duckdb <<<"set preserve_insertion_order=true ; copy (select * from read_csv_auto('ccaed182.csv.gz', all_varchar=true)) to 'ccaed182-26.parquet' (format 'parquet', codec 'zstd');" # 4.14 GB  794.47s user  35.77s system  99% cpu 13:52.37 total
time ~/bin/duckdb <<<"set preserve_insertion_order=true ; copy                              'ccaed182.csv.gz'                     to 'ccaed182-22.parquet' (format 'parquet', codec 'zstd');" # 4.79 GB  632.53s user  29.95s system  99% cpu 11:03.74 total
time ~/bin/duckdb <<<"set preserve_insertion_order=false; copy (select * from read_csv_auto('ccaed182.csv.gz', all_varchar=true)) to 'ccaed182-24.parquet' (format 'parquet', codec 'zstd');" # 4.22 GB 1145.16s user  26.05s system 576% cpu  3:23.13 total
time ~/bin/duckdb <<<"set preserve_insertion_order=false; copy                              'ccaed182.csv.gz'                     to 'ccaed182-20.parquet' (format 'parquet', codec 'zstd');" # 4.82 GB  899.18s user  26.96s system 522% cpu  2:57.35 total

For reference, this is on-the-fly decompression of CSV files, reading those with a parallelized CSV reader, and then converted the output to compressed parquet files (which will will save directly to S3)… all in ONE pass! Amazing.

An absolutely astonishing improvement!

v0.7.2-dev1322 4161f39ca1
Enter ".help" for usage hints.
D set preserve_insertion_order=false;
D .timer on
D create or replace table ccaed182 as select * from read_csv_auto('ccaed182.csv.gz', all_varchar=true, header=true);
100% ▕████████████████████████████████████████████████████████████▏
Run Time (s): real 962.424 user 952.173166 sys 7.611139
$ ls -l
-r--r--r--@ 1 shreeve  staff  6506608484 Jan 17 08:21 ccaed182.csv.gz
-rw-r--r--@ 1 shreeve  staff  4167902044 Apr  2 11:29 ccaed182.parquet

duckdb, ftw!