duckdb: Too many open files error using temp directory on large CSV

What happens?

I have a Mac M1 Max with 32 GB of RAM.
I wrote the following code that tries to do the following for each of 93 CSV files:

  1. create a temp table with the correct column names and types;
  2. copy a single CSV into that temp table, using the correct options; and
  3. copy the loaded table back out to a corresponding Parquet file.

The CSV files have the following characteristics:

  • 93 files totalling 750 GB uncompressed;
  • the smallest file is 125 MB; the largest file is 40 GB; and
  • about 20 files are larger than 10 GB.

For each CSV file, I run the following SQL code using the DuckDB CLI.

PRAGMA memory_limit = '24GB';
PRAGMA temp_directory = 'data/temp/';
  
CREATE TEMP TABLE t1 (reference_pool_id VARCHAR, loan_identifier VARCHAR, monthly_reporting_period DATE, <more columns>);
COPY t1 FROM 'data/stage/2000Q1.csv' (
  HEADER FALSE, DATEFORMAT '%m%Y', DELIM '|', SAMPLE_SIZE 500000);
COPY t1 TO 'data/parquet/2000Q1.parquet' (FORMAT PARQUET);
DROP table t1;

I’m able to run this code successfully for 91 of the 93 CSV files. However, the two largest files - 33 GB and 39 GB - I get the following type of error:

Error: near line 83: IO Error: Cannot open file “data/temp//duckdb_temp_storage-252.tmp”: Too many open files

So, it appears that loading the largest CSVs via the COPY statement opens up too many temp files.

I also noticed some other interesting behavior with the temp files. During the run, the temp directory grows to about 300 GB and 252 files, then duckdb crashes. I don’t have any idea why a 30GB CSV file generates over 300GB of temp files and a crash.

To Reproduce

Unfortunately, the data is proprietary. I don’t have time to figure out how to simulate a similar sized CSV using faker or the like; my apologies. I just thought that I should report the “too many temp files” error, since most current use cases may not use such large files. Thanks for all the hard work on DuckDB!

OS:

macOS Ventura 13.4.1

DuckDB Version:

0.8.1

DuckDB Client:

DuckDB CLI

Full Name:

Kyle Lundstedt

Affiliation:

IndustryVault

Have you tried this on the latest main branch?

I have tested with a main build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • Yes, I have

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Comments: 34 (4 by maintainers)

Most upvoted comments

Wow! I re-tried the “single step” code, changing the syntax per @l1t1’s suggestion. Given I’m using the COLUMNS option, I changed the function from read_csv_auto to read_csv. Here’s the code:

copy (
  select * from read_csv('data/stage/*.csv',
    HEADER = FALSE, DATEFORMAT = '%m%Y', DELIM = '|',
    COLUMNS = {'reference_pool_id': 'VARCHAR', <more columns>} )
) to 'data/public/performance' (
    FORMAT PARQUET, PER_THREAD_OUTPUT TRUE);

Summarizing the results:

  • I’m using Mac M1 Max (10 cores) with 32 GB of RAM, running DuckDB 0.9.1.
  • I’m converting 93 CSV files (750 GB uncompressed) to Parquet, writing an output file per thread.
  • My code snippet runs in 22 minutes, and produces a 43 GB Parquet folder (10 files, 1 per thread).

Many thanks to @l1t1, @szarnyasg, @Tishj, and @arjenpdevries for their help in sorting through the out-of-memory and conversion date issues here.

Kudos to the entire DuckDB team on v0.9.1. What an amazing tool!

Good news; v0.9.1 solves the remaining out-of-memory issues. (I’m still using ALL_VARCHAR = 1 to avoid the “Conversion error” issue.)

COPY (
  SELECT *
  FROM read_csv_auto('data/stage/*.csv',
  	 HEADER = FALSE, DATEFORMAT = '%m%Y', DELIM = '|', ALL_VARCHAR = 1)
) TO 'data/public/performance' (
    FORMAT PARQUET, PER_THREAD_OUTPUT TRUE);

With 0.9.1, there’s no need to set either THREADS or MEMORY_LIMIT. Based on some testing, it looks like the “no THREADS or MEMORY_LIMIT” option is equivalent to running with THREADS = 8, MEMORY_LIMIT = '4GB'. Both version runs in 26 minutes, use about 18GB RAM at peak, and do not use any temp files.

Increasing the MEMORY_LIMIT has no impact.

The PER_THREAD_OUTPUT TRUE is critical. Not using that option causes the out-of-memory issue to return.