ClickHouse: S3 multi thread Reading does not work

See https://clickhousedb.slack.com/archives/CU478UEQZ/p1694647285875419

Describe the situation No matter the core count (in the example above), once the number of max_threads exceeds the file count of parquet files in S3. queries do not perform better. with ~100-250MB files clickhouse should be able to fetch parts of files concurrently (row group size 112k, a rows in millions for most if not all files IIUC).

For example a 32 core machine with 12.5gbit nic at 128 threads is no faster than a 128 core machine with 170gbit nic reading from S3.

How to reproduce

  • Which ClickHouse server version to use any 23.8

Expected performance When I go to 300 threads on a 128 core machine, I’d expect ~1/2 query time.

Also seems to happen with s3Cluster (no performance increase even with more cores and more interfaces)

About this issue

  • Original URL
  • State: closed
  • Created 9 months ago
  • Reactions: 1
  • Comments: 19 (6 by maintainers)

Most upvoted comments

Huh, parallel-reading from a single S3 file appears to be broken. I’ll fix it tomorrow, unless ~@Avogar~ @pufit fixes it first.

(The problem is much more obvious on one big file, like hits.parquet. Reading it with s3() is many times slower than url().)

I also noticed that we don’t reuse connections in s3() table function, so the query creates a new connection for each of the 2K reads. But I tried adding connection reuse, and the query didn’t get noticeably faster, weird. I’ll add it anyway, it’s easy, and surely it’ll be useful in other situations (especially cross-region).