ClickHouse: S3 multi thread Reading does not work
See https://clickhousedb.slack.com/archives/CU478UEQZ/p1694647285875419
Describe the situation
No matter the core count (in the example above), once the number of max_threads exceeds the file count of parquet files in S3. queries do not perform better. with ~100-250MB files clickhouse should be able to fetch parts of files concurrently (row group size 112k, a rows in millions for most if not all files IIUC).
For example a 32 core machine with 12.5gbit nic at 128 threads is no faster than a 128 core machine with 170gbit nic reading from S3.
How to reproduce
- Which ClickHouse server version to use any 23.8
Expected performance When I go to 300 threads on a 128 core machine, I’d expect ~1/2 query time.
Also seems to happen with s3Cluster (no performance increase even with more cores and more interfaces)
About this issue
- Original URL
- State: closed
- Created 9 months ago
- Reactions: 1
- Comments: 19 (6 by maintainers)
Huh, parallel-reading from a single S3 file appears to be broken. I’ll fix it tomorrow, unless ~@Avogar~ @pufit fixes it first.
(The problem is much more obvious on one big file, like
hits.parquet. Reading it with s3() is many times slower than url().)I also noticed that we don’t reuse connections in s3() table function, so the query creates a new connection for each of the 2K reads. But I tried adding connection reuse, and the query didn’t get noticeably faster, weird. I’ll add it anyway, it’s easy, and surely it’ll be useful in other situations (especially cross-region).