ClickHouse: [Apache Iceberg] Failing to create table

Creating a table failed when using apache iceberg:

CREATE TABLE iceberg Engine=Iceberg(...)

2023.05.11 10:21:36.140819 [ 19080 ] {bc194292-7197-4ab9-9a30-b9461ab43ecd} <Error> TCPHandler: Code: 499. DB::Exception: Failed to get object info: No response body.. HTTP response code: 404. (S3_ERROR), Stack trace (when copying this message, always include the lines below):
0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0xe3b83d5 in /usr/bin/clickhouse
1. ? @ 0x9801d4d in /usr/bin/clickhouse
2. DB::S3::getObjectInfo(DB::S3::Client const&, String const&, String const&, String const&, DB::S3Settings::RequestSettings const&, bool, bool, bool) @ 0x126aa4fe in /usr/bin/clickhouse
3. DB::StorageS3Source::KeysIterator::KeysIterator(DB::S3::Client const&, String const&, std::vector<String, std::allocator<String>> const&, String const&, DB::S3Settings::RequestSettings const&, std::shared_ptr<DB::IAST>, DB::Block const&, std::shared_ptr<DB::Context const>, std::vector<DB::StorageS3Source::KeyWithInfo, std::allocator<DB::StorageS3Source::KeyWithInfo>>*) @ 0x1447e152 in /usr/bin/clickhouse
4. DB::StorageS3::createFileIterator(DB::StorageS3::Configuration const&, bool, std::shared_ptr<DB::Context const>, std::shared_ptr<DB::IAST>, DB::Block const&, std::vector<DB::StorageS3Source::KeyWithInfo, std::allocator<DB::StorageS3Source::KeyWithInfo>>*) @ 0x14486c4b in /usr/bin/clickhouse
5. DB::StorageS3::getTableStructureFromDataImpl(DB::StorageS3::Configuration const&, std::optional<DB::FormatSettings> const&, std::shared_ptr<DB::Context const>) @ 0x144861c6 in /usr/bin/clickhouse

We have looked to the codebase and I have the feeling that you’re calling the head object on a folder to get the last modification time but a folder on S3 has never a last modification time. We have run the following command on the same buckert and keys set on the IcebergEngine and we see the same exact error (aws s3api head-object bucket --key mykey returns a 404).

Version used: 23.4.1.1943

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 20 (12 by maintainers)

Most upvoted comments

Sure, schema inference works only if user didn’t specify structure manually. You always can specify structure in the create statement as usual:

CREATE TABLE (column1 Type1, column2 Type2, ...) iceberg Engine=Iceberg(...)

As well as for table function (same as for s3 table function):

SELECT * FROM iceberg(url, format, 'column1 Type1, column2 Type2')

what do you think about the enhancement I’ve shared

It’s very good idea.

why not using the metadata for the schema inference ?

It just wasn’t implemented. Current schema inference was just derived from S3 table engine.

Actually, @kssenii it’s your code 😃 https://github.com/ClickHouse/ClickHouse/pull/43454

It was needed for calculating the total_size for progress bar. We use the same iterator for reading and for schema inference. For reading it’s ok, we will do all these head requests anyway, but for schema inference we should not do it, we read only some first files, and we don’t need to calculate total_size because we don’t send progress on schema inference. We can add a flag for KeysIterator for schema inference to do the head request only when we requested new key. I will create a PR for it

UPD: https://github.com/ClickHouse/ClickHouse/pull/50203

Hey @kssenii

Let me give you a simple and easy to reproduce.

How did I’ve generated the data ?

Run spark-sql with

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.2.1 \  
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \ 
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog  \
--conf spark.sql.catalog.spark_catalog.type=hive \
--conf spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog \   
--conf spark.sql.catalog.iceberg.type=hadoop   \  
--conf spark.sql.catalog.iceberg.warehouse=s3://cs-tmp/akilic/iceberg-catalog/ \
 --conf spark.sql.defaultCatalog=iceberg
spark-sql> create table iceberg.test (a bigint) TBLPROPERTIES('format-version'='2');
spark-sql> INSERT INTO iceberg.test values 1, 2, 3; 
spark-sql>select * from iceberg.test ;
1
2
3

As you can see I’m able to read the data with spark-sql .

What do we have in AWS?

Data

$ aws s3 ls  s3://cs-tmp/akilic/iceberg-catalog/test/data/
2023-05-17 22:52:14        422 00000-0-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet
2023-05-17 22:52:15        425 00001-1-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet

Metada

$ aws s3 ls  s3://cs-tmp/akilic/iceberg-catalog/test/metadata/                                                                                                                                            0 [22:55:28]
2023-05-17 22:52:15       6629 0e150576-5308-4020-a66f-cdf8cc4cc2c2-m0.avro
2023-05-17 22:52:15       4263 snap-7532076000798921356-1-0e150576-5308-4020-a66f-cdf8cc4cc2c2.avro
2023-05-17 22:51:38        847 v1.metadata.json
2023-05-17 22:52:16       1891 v2.metadata.json
2023-05-17 22:52:16          1 version-hint.text

ClickHouse

I’m using the docker image clickhouse-server:23.4-alpine

The issue happens when executing this simple request:

$  SET send_logs_level = 'trace'; 
$ create table test Engine = Iceberg('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/')
[clickhouse1] 2023.05.17 21:07:03.317667 [ 9 ] {9cedf26e-46a4-4b0f-a616-995a1919f8d7} <Trace> ContextAccess (default): Access granted: CREATE TABLE ON default.test
[clickhouse1] 2023.05.17 21:07:03.319788 [ 9 ] {9cedf26e-46a4-4b0f-a616-995a1919f8d7} <Trace> ContextAccess (default): Access granted: S3 ON *.*
[clickhouse1] 2023.05.17 21:07:03.321678 [ 9 ] {9cedf26e-46a4-4b0f-a616-995a1919f8d7} <Trace> ContextAccess (default): Access granted: S3 ON *.*
[clickhouse1] 2023.05.17 21:07:03.337961 [ 9 ] {9cedf26e-46a4-4b0f-a616-995a1919f8d7} <Trace> NamedCollectionsUtils: Loaded 0 collections from SQL
[clickhouse1] 2023.05.17 21:07:04.549157 [ 9 ] {9cedf26e-46a4-4b0f-a616-995a1919f8d7} <Trace> DataLake: New configuration path: akilic/iceberg-catalog/test/, keys: s3://cs-tmp/akilic/iceberg-catalog/test/data/00000-0-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet, s3://cs-tmp/akilic/iceberg-catalog/test/data/00001-1-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet
[clickhouse1] 2023.05.17 21:07:06.354270 [ 9 ] {9cedf26e-46a4-4b0f-a616-995a1919f8d7} <Error> executeQuery: Code: 499. DB::Exception: Failed to get object info: No response body.. HTTP response code: 404. (S3_ERROR) (version 23.4.2.11 (official build)) (from 127.0.0.1:56116) (in query: create table test Engine = Iceberg('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/')), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0xbc87ee4 in /usr/bin/clickhouse
1. ? @ 0x86621f8 in /usr/bin/clickhouse
2. DB::S3::getObjectInfo(DB::S3::Client const&, String const&, String const&, String const&, DB::S3Settings::RequestSettings const&, bool, bool, bool) @ 0xf5f857c in /usr/bin/clickhouse
3. DB::StorageS3Source::KeysIterator::KeysIterator(DB::S3::Client const&, String const&, std::vector<String, std::allocator<String>> const&, String const&, DB::S3Settings::RequestSettings const&, std::shared_ptr<DB::IAST>, DB::Block const&, std::shared_ptr<DB::Context const>, std::vector<DB::StorageS3Source::KeyWithInfo, std::allocator<DB::StorageS3Source::KeyWithInfo>>*) @ 0x10e09e78 in /usr/bin/clickhouse
4. DB::StorageS3::createFileIterator(DB::StorageS3::Configuration const&, bool, std::shared_ptr<DB::Context const>, std::shared_ptr<DB::IAST>, DB::Block const&, std::vector<DB::StorageS3Source::KeyWithInfo, std::allocator<DB::StorageS3Source::KeyWithInfo>>*) @ 0x10e11170 in /usr/bin/clickhouse
5. DB::StorageS3::getTableStructureFromDataImpl(DB::StorageS3::Configuration const&, std::optional<DB::FormatSettings> const&, std::shared_ptr<DB::Context const>) @ 0x10e10910 in /usr/bin/clickhouse
6. DB::StorageS3::StorageS3(DB::StorageS3::Configuration const&, std::shared_ptr<DB::Context const>, DB::StorageID const&, DB::ColumnsDescription const&, DB::ConstraintsDescription const&, String const&, std::optional<DB::FormatSettings>, bool, std::shared_ptr<DB::IAST>) @ 0x10e0f794 in /usr/bin/clickhouse
7. ? @ 0x10ed2294 in /usr/bin/clickhouse
8. DB::StorageFactory::get(DB::ASTCreateQuery const&, String const&, std::shared_ptr<DB::Context>, std::shared_ptr<DB::Context>, DB::ColumnsDescription const&, DB::ConstraintsDescription const&, bool) const @ 0x10c7d7fc in /usr/bin/clickhouse
9. DB::InterpreterCreateQuery::doCreateTable(DB::ASTCreateQuery&, DB::InterpreterCreateQuery::TableProperties const&, std::unique_ptr<DB::DDLGuard, std::default_delete<DB::DDLGuard>>&) @ 0x104b89b0 in /usr/bin/clickhouse
10. DB::InterpreterCreateQuery::createTable(DB::ASTCreateQuery&) @ 0x104b2a9c in /usr/bin/clickhouse
11. DB::InterpreterCreateQuery::execute() @ 0x104bd204 in /usr/bin/clickhouse
12. ? @ 0x1097d198 in /usr/bin/clickhouse
13. DB::executeQuery(String const&, std::shared_ptr<DB::Context>, bool, DB::QueryProcessingStage::Enum) @ 0x1097a5c4 in /usr/bin/clickhouse
14. DB::TCPHandler::runImpl() @ 0x11531b1c in /usr/bin/clickhouse
15. DB::TCPHandler::run() @ 0x115443e4 in /usr/bin/clickhouse
16. Poco::Net::TCPServerConnection::start() @ 0x121b0604 in /usr/bin/clickhouse
17. Poco::Net::TCPServerDispatcher::run() @ 0x121b1b20 in /usr/bin/clickhouse
18. Poco::PooledThread::run() @ 0x1235ac7c in /usr/bin/clickhouse
19. Poco::ThreadImpl::runnableEntry(void*) @ 0x12358544 in /usr/bin/clickhouse
20. start_thread @ 0x7624 in /lib/libpthread.so.0
21. ? @ 0xd149c in /lib/libc.so.6

In the trace you see two data file mentionned:

  • s3://cs-tmp/akilic/iceberg-catalog/test/data/00000-0-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet
  • s3://cs-tmp/akilic/iceberg-catalog/test/data/00001-1-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet

and if we can see that it consistent with

$ aws s3 ls  s3://cs-tmp/akilic/iceberg-catalog/test/data/
2023-05-17 22:52:14        422 00000-0-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet
2023-05-17 22:52:15        425 00001-1-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet

Let’s check the head-object request for both files:

aws s3api head-object --bucket cs-tmp --key akilic/iceberg-catalog/test/data/00000-0-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet
{
    "AcceptRanges": "bytes",
    "Expiration": "expiry-date=\"Sat, 17 Jun 2023 00:00:00 GMT\", rule-id=\"1-month retention\"",
    "LastModified": "2023-05-17T20:52:14+00:00",
    "ContentLength": 422,
    "ETag": "\"9d27f6c2a869bf8424fc66076918b5d9\"",
    "ContentType": "binary/octet-stream",
    "ServerSideEncryption": "AES256",
    "Metadata": {}
}

aws s3api head-object --bucket cs-tmp --key akilic/iceberg-catalog/test/data/00001-1-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet                                                                 0 [23:10:39]
{
    "AcceptRanges": "bytes",
    "Expiration": "expiry-date=\"Sat, 17 Jun 2023 00:00:00 GMT\", rule-id=\"1-month retention\"",
    "LastModified": "2023-05-17T20:52:15+00:00",
    "ContentLength": 425,
    "ETag": "\"1c919896c4bfc3f46260c2d7baa9e55c\"",
    "ContentType": "binary/octet-stream",
    "ServerSideEncryption": "AES256",
    "Metadata": {}
}

If we tried to read the data without Iceberg:

select * from s3('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/data/00001-1-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet')
─a─┐
│ 2 │
│ 3 │
└───┘

select * from s3('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/metadata/0e150576-5308-4020-a66f-cdf8cc4cc2c2-m0.avro')

SELECT *
FROM s3('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/metadata/0e150576-5308-4020-a66f-cdf8cc4cc2c2-m0.avro')

Query id: 4050b563-57f2-4b2f-ab12-f4aac87a9cff

[clickhouse1] 2023.05.17 21:14:14.402574 [ 9 ] {4050b563-57f2-4b2f-ab12-f4aac87a9cff} <Debug> executeQuery: (from 127.0.0.1:56116) select * from s3('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/metadata/0e150576-5308-4020-a66f-cdf8cc4cc2c2-m0.avro') (stage: Complete)
Exception on client:
Code: 92. DB::Exception: Tuple cannot be empty: while receiving packet from localhost:9000. (EMPTY_DATA_PASSED)

select * from s3('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/metadata/snap-7532076000798921356-1-0e150576-5308-4020-a66f-cdf8cc4cc2c2.avro') Format Vertical

Row 1:
──────
manifest_path:             s3://cs-tmp/akilic/iceberg-catalog/test/metadata/0e150576-5308-4020-a66f-cdf8cc4cc2c2-m0.avro
manifest_length:           6629
partition_spec_id:         0
content:                   0
sequence_number:           1
min_sequence_number:       1
added_snapshot_id:         7532076000798921356
added_data_files_count:    2
existing_data_files_count: 0
deleted_data_files_count:  0
added_rows_count:          3
existing_rows_count:       0
deleted_rows_count:        0
partitions:                []

SELECT *
FROM s3('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/metadata/v2.metadata.json')

Code: 117. DB::Exception: Expected field "meta" with columns names and types, found field format-version: Cannot extract table structure from JSON format file. You can specify the structure manually. (INCORRECT_DATA) (version 23.4.2.11 (official build)) (from 127.0.0.1:39778) (in query: select * from s3('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/metadata/v2.metadata.json')), Stack trace (when copying this message, always include the lines below):

I’ve put as much details possible to help to know if the issue is in the way we have generated the data or there is a real bug in the 23.4 release