duckdb: Question, Importing data from Parquet does not work ?

I am importing the following parquet files, https://drive.google.com/drive/folders/1mZC3NuPBZC4mjP3_kH18c9fLrv8ME7RU

the size in Parquet is around 2.5 GB Duckdb DB size is 8.2 GB

I am using this python script

import duckdb 
con = duckdb.connect(database='db_import')


df =con.execute('''


CREATE TABLE partsupp AS SELECT * FROM 'C:/Users/mimoune.djouallah/Desktop/TPC-H-SF10/Parquet/partsupp.parquet';
CREATE TABLE part AS SELECT * FROM 'C:/Users/mimoune.djouallah/Desktop/TPC-H-SF10/Parquet/part.parquet';
CREATE TABLE supplier AS SELECT * FROM 'C:/Users/mimoune.djouallah/Desktop/TPC-H-SF10/Parquet/supplier.parquet';
CREATE TABLE nation AS SELECT * FROM 'C:/Users/mimoune.djouallah/Desktop/TPC-H-SF10/Parquet/nation.parquet';
CREATE TABLE region AS SELECT * FROM 'C:/Users/mimoune.djouallah/Desktop/TPC-H-SF10/Parquet/region.parquet';
CREATE TABLE lineitem AS SELECT * FROM 'C:/Users/mimoune.djouallah/Desktop/TPC-H-SF10/Parquet/lineitem.parquet';
CREATE TABLE orders AS SELECT * FROM 'C:/Users/mimoune.djouallah/Desktop/TPC-H-SF10/Parquet/orders.parquet';
CREATE TABLE customer AS SELECT * FROM 'C:/Users/mimoune.djouallah/Desktop/TPC-H-SF10/Parquet/customer.parquet';

''')

when I run PRAGMA show_tables;

it show nothing, how I check why the import is not working ?

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 24 (5 by maintainers)

Most upvoted comments

We generally don’t update existing releases, but we might consider releasing 0.4.1 soon which has the fix included.

The database does not get corrupted. The problem is that because the FileExists method is broken on the older version, DuckDB does not realize the database exists in the first place, so it creates a new database instead.

Imagine the code looks like this:

if (FileExists(database)) {
   OpenDatabase(database);
} else {
   CreateNewDatabase(database);
}

Now if FileExists returns false even though the database exists, the system will never open the existing database, and will instead always create a new one.

Hi, it seens this issue remains as of duckdb version 0.4.0. I will put some notes about what I have found so far:

  • It is related to the parquet file size: a file (either compressed or uncompressed) slightly above 2Gb shows the error IO Error: No files found that match the pattern "<file-path>", but a file slightly below this threshold does not show the error. Here is a dummy generated test file: duckdb_io_error_gzip.parquet.
  • It is a regression bug that was introduced after 0.3.2 release, because this release does not show this error but release 0.3.3-0.3.4 do.
  • It does not seems a memory issue, because it affects the schema and metadata queries too (see below) and I have tested in environments with more than 500Gb RAM.

What happens?

Querying Parquet files above 2Gb shows the error IO Error: No files found that match the pattern "<file-path>" in the Windows enviroment.

To Reproduce

  1. Generate dummy test file. Here is one example: duckdb_io_error_gzip.parquet
  2. Try one of those queries:
  • SELECT * FROM parquet_schema('M:\Temp\duckdb_io_error_gzip.parquet')
  • SELECT * FROM parquet_metadata('M:\Temp\duckdb_io_error_gzip.parquet')
  • SELECT * FROM parquet_scan('M:\Temp\duckdb_io_error_gzip.parquet') LIMIT 100
  • CREATE VIEW duckdb_io_error AS SELECT * FROM parquet_scan('M:\Temp\duckdb_io_error_gzip.parquet')

Environment:

  • OS: Windows 10 64bit
  • DuckDB Version: 0.4.0
  • DuckDB Client: Python, R, JDBC and CLI

Identity Disclosure:

  • Full Name: Fabio Monteiro Vaz
  • Affiliation: Ipea

Before Submitting

  • [v] Have you tried this on the latest master branch?
  • Python: pip install duckdb --upgrade --pre
  • R: install.packages("https://github.com/duckdb/duckdb/releases/download/v0.4.0/duckdb_r_src.tar.gz", type = "source", repos = NULL)
  • Other Platforms: You can find binaries here or compile from source.
  • [v] Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

I have the same issue as @djouallah. I’ve created a testcase (see below), which probably is redudant, but for the sake of future reference, this is what I did. After running the code below a 3Gb duckdb is created, but when queried, for example with DBeaver, it does not show the ‘example’-table in the database, and ‘PRAGMA SHOW_TABLES;’ is empty.

create dummy parquet-file (51Mb)

Create a dummy parquet-file that can be shared.

import pandas as pd
fpath = 'C:\exampledata\example.parquet'

df = pd.read_csv('https://www.stats.govt.nz/assets/Uploads/International-trade/International-trade-March-2022-quarter/Download-data/overseas-trade-indexes-March-2022-quarter-provisional-csv.csv', low_memory=False)
for i in range(8):
    df = pd.concat([df,df])
    
df.to_parquet(fpath, index=False)

import dummydata to duckdb

import duckdb
fpath = 'C:\exampledata\example.parquet'
duckdb_path = 'C:\exampledata\example.duckdb'
db_name = 'test_db2'

duckdb_con = duckdb.connect(duckdb_path, read_only=False)


q_drop_table = f"DROP TABLE IF EXISTS {db_name};"
duckdb_con.execute(q_drop_table)

create = f"""CREATE TABLE {db_name}(Series_reference     VARCHAR,
                                    Period              FLOAT,
                                    Data_value          FLOAT,
                                    STATUS               VARCHAR,
                                    UNITS                VARCHAR,
                                    MAGNTUDE             BIGINT,
                                    Subject              VARCHAR,
                                    "Group"              VARCHAR,
                                    Series_title_1       VARCHAR,
                                    Series_title_2       VARCHAR,
                                    Series_title_3       VARCHAR,
                                    Series_title_4       VARCHAR,
                                    Series_title_5       VARCHAR)"""
        
duckdb_con.execute(create)


copy = f"COPY {db_name} FROM '{fpath}' (FORMAT PARQUET);"

for i in range(8):
    print('run-number: ', i)
    duckdb_con.execute(copy)

I can’t reproduce this issue, this seems to work fine for me:

>>> import duckdb
>>> con = duckdb.connect(database='db_import')
>>> con.execute('create table  LINEITEM as select * FROM lineitem.parquet')
>>> exit()


>>> import duckdb
>>> con = duckdb.connect(database='db_import')
>>> con.execute('select count(*) from lineitem').df()
   count_star()
0      59986052

yes I missed the (FORMAT PARQUET); part

So the copy statement wasn’t working? This syntax should be equivalent:

COPY tbl FROM 'input.parquet' (FORMAT PARQUET)

https://duckdb.org/docs/guides/import/parquet_import

Hello! It is somewhat expected that DuckDB is less compressed than Parquet at the moment. We aspire to get there over time! The reason for the difference is that DuckDB needs to support more granular access to the data as well as individual updates and deletes. We need to be able to write a small piece of data to the DB rather than rewrite the whole file like parquet does.

Check this out to see the details:

Select * from pragma_storage_info('mytable')