ClickHouse: Massive slowdown when simply including a (large) column in the result

Describe the situation Simply adding a column in the result (even if only used for a final lookup) slows the whole query down massively since it get read completely as far as I can tell.

Please note that I’m just evaluating ClickHouse and I might be doing something dumb when specifying the table layout or writing queries.

How to reproduce Running on a single host with NVMe (read bandwith ~2GiB/s) with ClickHouse 19.15.2.2

Dataset is publicly available: https://blob.dolansoft.org/datasets/boyter-10m-repos/clickhouse-columns.tar (⚠️ ~50GiB / 3.5B+ rows)

ATTACH TABLE repositories
(
    `id` UInt32 CODEC(DoubleDelta), 
    `source` LowCardinality(String) CODEC(LZ4), 
    `user` String CODEC(LZ4), 
    `name` String CODEC(LZ4)
)
ENGINE = MergeTree
ORDER BY id
SETTINGS index_granularity = 8192

ATTACH TABLE files
(
    `repository_id` UInt32 CODEC(DoubleDelta), 
    `location` String CODEC(ZSTD(1)), 
    `language` LowCardinality(String), 
    `possible_languages` Array(LowCardinality(String)), 
    `bytes` UInt32 CODEC(T64, LZ4), 
    `lines` UInt32 CODEC(T64, LZ4), 
    `code` UInt32 CODEC(T64, LZ4), 
    `comment` UInt32 CODEC(T64, LZ4), 
    `blank` UInt32 CODEC(T64, LZ4), 
    `complexity` UInt32 CODEC(T64, LZ4), 
    `weighted_complexity` UInt32 CODEC(T64, LZ4)
)
ENGINE = MergeTree
ORDER BY repository_id
SETTINGS index_granularity = 8192

Query A (fast):

SELECT 
    repository_id, 
    lines
FROM files
WHERE language = 'Plain Text'
ORDER BY lines DESC
LIMIT 1

┌─repository_id─┬─────lines─┐
│       3919430 │ 347671811 │
└───────────────┴───────────┘

1 rows in set. Elapsed: 3.587 sec. Processed 3.53 billion rows, 4.82 GB (983.87 million rows/s., 1.34 GB/s.) 

Query B (slow):

SELECT 
    repository_id, 
    location, 
    lines
FROM files
WHERE language = 'Plain Text'
ORDER BY lines DESC
LIMIT 1

┌─repository_id─┬─location────────────────┬─────lines─┐
│       3919430 │ data/1366100696temp.txt │ 347671811 │
└───────────────┴─────────────────────────┴───────────┘

1 rows in set. Elapsed: 25.234 sec. Processed 3.53 billion rows, 13.53 GB (139.87 million rows/s., 536.21 MB/s.) 

Expected performance I would expect both of these queries to take approximately the same time since ClickHouse can ignore the location column until it has found the single match and then just read that chunk from the (much heavier) location column. Instead it looks like it tries to read the whole location column and slows down the query around 8 times. I’ve also tried with argMax() instead of order by x limit 1 but it seems to suffer from the same issue.

Originally I also had a join to repositories in there, but that did not change the performance of either query so I’ve removed it in the interest of a more minimal reproducer.

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Reactions: 5
  • Comments: 21 (13 by maintainers)

Commits related to this issue

Most upvoted comments

@filimonov the issue is not related to CODECs.

create table Xp(id Int64, a UInt8, b String) Engine = MergeTree order by id
settings index_granularity=1;

insert into Xp select number,
        number%103, 
       toString(arrayMap(x->sipHash64(number,x), range(50)))
  from numbers(10000000);

set max_threads=1;

select id, a, b from Xp order by a limit 1;
Elapsed: 18.953 sec. 

select id, a, b from Xp where id = (select id from Xp order by a limit 1);
Elapsed: 3.014 sec.

repository_id UInt32 CODEC(DoubleDelta),

This is probably should by CODEC(DoubleDelta, LZ4)) Because DoubleDelta does not do compression.

It seems the reason is slow decompression of ZSTD(1). And CH decompresses repository_id,location,lines before order by limit

Try this for a test (to check my guess) :

SELECT 
    repository_id, 
    location, 
    lines
FROM files
PREWHERE repository_id in (
           SELECT repository_id FROM files
           PREWHERE language = 'Plain Text'
           ORDER BY lines DESC
           LIMIT 1
           )
ORDER BY lines DESC
LIMIT 1