duckdb: Question, Importing data from Parquet does not work ?
I am importing the following parquet files, https://drive.google.com/drive/folders/1mZC3NuPBZC4mjP3_kH18c9fLrv8ME7RU
the size in Parquet is around 2.5 GB Duckdb DB size is 8.2 GB
I am using this python script
import duckdb
con = duckdb.connect(database='db_import')
df =con.execute('''
CREATE TABLE partsupp AS SELECT * FROM 'C:/Users/mimoune.djouallah/Desktop/TPC-H-SF10/Parquet/partsupp.parquet';
CREATE TABLE part AS SELECT * FROM 'C:/Users/mimoune.djouallah/Desktop/TPC-H-SF10/Parquet/part.parquet';
CREATE TABLE supplier AS SELECT * FROM 'C:/Users/mimoune.djouallah/Desktop/TPC-H-SF10/Parquet/supplier.parquet';
CREATE TABLE nation AS SELECT * FROM 'C:/Users/mimoune.djouallah/Desktop/TPC-H-SF10/Parquet/nation.parquet';
CREATE TABLE region AS SELECT * FROM 'C:/Users/mimoune.djouallah/Desktop/TPC-H-SF10/Parquet/region.parquet';
CREATE TABLE lineitem AS SELECT * FROM 'C:/Users/mimoune.djouallah/Desktop/TPC-H-SF10/Parquet/lineitem.parquet';
CREATE TABLE orders AS SELECT * FROM 'C:/Users/mimoune.djouallah/Desktop/TPC-H-SF10/Parquet/orders.parquet';
CREATE TABLE customer AS SELECT * FROM 'C:/Users/mimoune.djouallah/Desktop/TPC-H-SF10/Parquet/customer.parquet';
''')
when I run PRAGMA show_tables;
it show nothing, how I check why the import is not working ?
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 24 (5 by maintainers)
We generally don’t update existing releases, but we might consider releasing 0.4.1 soon which has the fix included.
The database does not get corrupted. The problem is that because the
FileExistsmethod is broken on the older version, DuckDB does not realize the database exists in the first place, so it creates a new database instead.Imagine the code looks like this:
Now if
FileExistsreturns false even though the database exists, the system will never open the existing database, and will instead always create a new one.Hi, it seens this issue remains as of duckdb version
0.4.0. I will put some notes about what I have found so far:IO Error: No files found that match the pattern "<file-path>", but a file slightly below this threshold does not show the error. Here is a dummy generated test file: duckdb_io_error_gzip.parquet.0.3.2release, because this release does not show this error but release0.3.3-0.3.4do.What happens?
Querying Parquet files above 2Gb shows the error
IO Error: No files found that match the pattern "<file-path>"in the Windows enviroment.To Reproduce
SELECT * FROM parquet_schema('M:\Temp\duckdb_io_error_gzip.parquet')SELECT * FROM parquet_metadata('M:\Temp\duckdb_io_error_gzip.parquet')SELECT * FROM parquet_scan('M:\Temp\duckdb_io_error_gzip.parquet') LIMIT 100CREATE VIEW duckdb_io_error AS SELECT * FROM parquet_scan('M:\Temp\duckdb_io_error_gzip.parquet')Environment:
Identity Disclosure:
Before Submitting
masterbranch?pip install duckdb --upgrade --preinstall.packages("https://github.com/duckdb/duckdb/releases/download/v0.4.0/duckdb_r_src.tar.gz", type = "source", repos = NULL)I have the same issue as @djouallah. I’ve created a testcase (see below), which probably is redudant, but for the sake of future reference, this is what I did. After running the code below a 3Gb duckdb is created, but when queried, for example with DBeaver, it does not show the ‘example’-table in the database, and ‘PRAGMA SHOW_TABLES;’ is empty.
create dummy parquet-file (51Mb)
Create a dummy parquet-file that can be shared.
import dummydata to duckdb
I can’t reproduce this issue, this seems to work fine for me:
yes I missed the (FORMAT PARQUET); part
So the copy statement wasn’t working? This syntax should be equivalent:
https://duckdb.org/docs/guides/import/parquet_import
Hello! It is somewhat expected that DuckDB is less compressed than Parquet at the moment. We aspire to get there over time! The reason for the difference is that DuckDB needs to support more granular access to the data as well as individual updates and deletes. We need to be able to write a small piece of data to the DB rather than rewrite the whole file like parquet does.
Check this out to see the details: