duckdb: Too many open files error using temp directory on large CSV

What happens?

I have a Mac M1 Max with 32 GB of RAM.
I wrote the following code that tries to do the following for each of 93 CSV files:

create a temp table with the correct column names and types;
copy a single CSV into that temp table, using the correct options; and
copy the loaded table back out to a corresponding Parquet file.

The CSV files have the following characteristics:

93 files totalling 750 GB uncompressed;
the smallest file is 125 MB; the largest file is 40 GB; and
about 20 files are larger than 10 GB.

For each CSV file, I run the following SQL code using the DuckDB CLI.

PRAGMA memory_limit = '24GB';
PRAGMA temp_directory = 'data/temp/';
  
CREATE TEMP TABLE t1 (reference_pool_id VARCHAR, loan_identifier VARCHAR, monthly_reporting_period DATE, <more columns>);
COPY t1 FROM 'data/stage/2000Q1.csv' (
  HEADER FALSE, DATEFORMAT '%m%Y', DELIM '|', SAMPLE_SIZE 500000);
COPY t1 TO 'data/parquet/2000Q1.parquet' (FORMAT PARQUET);
DROP table t1;

I’m able to run this code successfully for 91 of the 93 CSV files. However, the two largest files - 33 GB and 39 GB - I get the following type of error:

Error: near line 83: IO Error: Cannot open file “data/temp//duckdb_temp_storage-252.tmp”: Too many open files

So, it appears that loading the largest CSVs via the COPY statement opens up too many temp files.

I also noticed some other interesting behavior with the temp files. During the run, the temp directory grows to about 300 GB and 252 files, then duckdb crashes. I don’t have any idea why a 30GB CSV file generates over 300GB of temp files and a crash.

To Reproduce

Unfortunately, the data is proprietary. I don’t have time to figure out how to simulate a similar sized CSV using faker or the like; my apologies. I just thought that I should report the “too many temp files” error, since most current use cases may not use such large files. Thanks for all the hard work on DuckDB!

OS:

macOS Ventura 13.4.1

DuckDB Version:

0.8.1

DuckDB Client:

DuckDB CLI

Full Name:

Kyle Lundstedt

Affiliation:

IndustryVault

Have you tried this on the latest `main` branch?

I have tested with a main build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

Yes, I have

About this issue

Original URL
State: closed
Created 10 months ago
Comments: 34 (4 by maintainers)

Most upvoted comments

Wow! I re-tried the “single step” code, changing the syntax per @l1t1’s suggestion. Given I’m using the COLUMNS option, I changed the function from read_csv_auto to read_csv. Here’s the code:

copy (
  select * from read_csv('data/stage/*.csv',
    HEADER = FALSE, DATEFORMAT = '%m%Y', DELIM = '|',
    COLUMNS = {'reference_pool_id': 'VARCHAR', <more columns>} )
) to 'data/public/performance' (
    FORMAT PARQUET, PER_THREAD_OUTPUT TRUE);

Summarizing the results:

I’m using Mac M1 Max (10 cores) with 32 GB of RAM, running DuckDB 0.9.1.
I’m converting 93 CSV files (750 GB uncompressed) to Parquet, writing an output file per thread.
My code snippet runs in 22 minutes, and produces a 43 GB Parquet folder (10 files, 1 per thread).

Many thanks to @l1t1, @szarnyasg, @Tishj, and @arjenpdevries for their help in sorting through the out-of-memory and conversion date issues here.

Kudos to the entire DuckDB team on v0.9.1. What an amazing tool!

kylelundstedt on Oct 17, 2023

Good news; v0.9.1 solves the remaining out-of-memory issues. (I’m still using ALL_VARCHAR = 1 to avoid the “Conversion error” issue.)

COPY (
  SELECT *
  FROM read_csv_auto('data/stage/*.csv',
  	 HEADER = FALSE, DATEFORMAT = '%m%Y', DELIM = '|', ALL_VARCHAR = 1)
) TO 'data/public/performance' (
    FORMAT PARQUET, PER_THREAD_OUTPUT TRUE);

With 0.9.1, there’s no need to set either THREADS or MEMORY_LIMIT. Based on some testing, it looks like the “no THREADS or MEMORY_LIMIT” option is equivalent to running with THREADS = 8, MEMORY_LIMIT = '4GB'. Both version runs in 26 minutes, use about 18GB RAM at peak, and do not use any temp files.

Increasing the MEMORY_LIMIT has no impact.

The PER_THREAD_OUTPUT TRUE is critical. Not using that option causes the out-of-memory issue to return.

kylelundstedt on Oct 14, 2023