duckdb: Too many open files error using temp directory on large CSV
What happens?
I have a Mac M1 Max with 32 GB of RAM.
I wrote the following code that tries to do the following for each of 93 CSV files:
- create a temp table with the correct column names and types;
- copy a single CSV into that temp table, using the correct options; and
- copy the loaded table back out to a corresponding Parquet file.
The CSV files have the following characteristics:
- 93 files totalling 750 GB uncompressed;
- the smallest file is 125 MB; the largest file is 40 GB; and
- about 20 files are larger than 10 GB.
For each CSV file, I run the following SQL code using the DuckDB CLI.
PRAGMA memory_limit = '24GB';
PRAGMA temp_directory = 'data/temp/';
CREATE TEMP TABLE t1 (reference_pool_id VARCHAR, loan_identifier VARCHAR, monthly_reporting_period DATE, <more columns>);
COPY t1 FROM 'data/stage/2000Q1.csv' (
HEADER FALSE, DATEFORMAT '%m%Y', DELIM '|', SAMPLE_SIZE 500000);
COPY t1 TO 'data/parquet/2000Q1.parquet' (FORMAT PARQUET);
DROP table t1;
I’m able to run this code successfully for 91 of the 93 CSV files. However, the two largest files - 33 GB and 39 GB - I get the following type of error:
Error: near line 83: IO Error: Cannot open file “data/temp//duckdb_temp_storage-252.tmp”: Too many open files
So, it appears that loading the largest CSVs via the COPY statement opens up too many temp files.
I also noticed some other interesting behavior with the temp files. During the run, the temp directory grows to about 300 GB and 252 files, then duckdb crashes. I don’t have any idea why a 30GB CSV file generates over 300GB of temp files and a crash.
To Reproduce
Unfortunately, the data is proprietary. I don’t have time to figure out how to simulate a similar sized CSV using faker or the like; my apologies. I just thought that I should report the “too many temp files” error, since most current use cases may not use such large files. Thanks for all the hard work on DuckDB!
OS:
macOS Ventura 13.4.1
DuckDB Version:
0.8.1
DuckDB Client:
DuckDB CLI
Full Name:
Kyle Lundstedt
Affiliation:
IndustryVault
Have you tried this on the latest main branch?
I have tested with a main build
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- Yes, I have
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Comments: 34 (4 by maintainers)
Wow! I re-tried the “single step” code, changing the syntax per @l1t1’s suggestion. Given I’m using the COLUMNS option, I changed the function from
read_csv_autotoread_csv. Here’s the code:Summarizing the results:
Many thanks to @l1t1, @szarnyasg, @Tishj, and @arjenpdevries for their help in sorting through the out-of-memory and conversion date issues here.
Kudos to the entire DuckDB team on v0.9.1. What an amazing tool!
Good news; v0.9.1 solves the remaining out-of-memory issues. (I’m still using
ALL_VARCHAR = 1to avoid the “Conversion error” issue.)With 0.9.1, there’s no need to set either THREADS or MEMORY_LIMIT. Based on some testing, it looks like the “no THREADS or MEMORY_LIMIT” option is equivalent to running with
THREADS = 8, MEMORY_LIMIT = '4GB'. Both version runs in 26 minutes, use about 18GB RAM at peak, and do not use any temp files.Increasing the MEMORY_LIMIT has no impact.
The
PER_THREAD_OUTPUT TRUEis critical. Not using that option causes the out-of-memory issue to return.