ClickHouse: Queries on distributed web tables using parallel replicas do duplicate work rather than distributing
Describe what’s wrong
When performing queries on parallel replicas of a web table, work is duplicated rather than distributed which causes redundant processing and slows down final query results.
Does it reproduce on recent release?
Yes
Version
Version 23.6.2
How to reproduce
ATTACH TABLE github_events UUID '127f4241-4a9b-4ecd-8a84-846b88069cb5' on cluster '{cluster}'
(
`file_time` DateTime,
`event_type` Enum8('CommitCommentEvent' = 1, 'CreateEvent' = 2, 'DeleteEvent' = 3, 'ForkEvent' = 4, 'GollumEvent' = 5, 'IssueCommentEvent' = 6, 'IssuesEvent' = 7, 'MemberEvent' = 8, 'PublicEvent' = 9, 'PullRequestEvent' = 10, 'PullRequestReviewCommentEvent' = 11, 'PushEvent' = 12, 'ReleaseEvent' = 13, 'SponsorshipEvent' = 14, 'WatchEvent' = 15, 'GistEvent' = 16, 'FollowEvent' = 17, 'DownloadEvent' = 18, 'PullRequestReviewEvent' = 19, 'ForkApplyEvent' = 20, 'Event' = 21, 'TeamAddEvent' = 22),
`actor_login` LowCardinality(String),
`repo_name` LowCardinality(String),
`created_at` DateTime,
`updated_at` DateTime,
`action` Enum8('none' = 0, 'created' = 1, 'added' = 2, 'edited' = 3, 'deleted' = 4, 'opened' = 5, 'closed' = 6, 'reopened' = 7, 'assigned' = 8, 'unassigned' = 9, 'labeled' = 10, 'unlabeled' = 11, 'review_requested' = 12, 'review_request_removed' = 13, 'synchronize' = 14, 'started' = 15, 'published' = 16, 'update' = 17, 'create' = 18, 'fork' = 19, 'merged' = 20),
`comment_id` UInt64,
`body` String,
`path` String,
`position` Int32,
`line` Int32,
`ref` LowCardinality(String),
`ref_type` Enum8('none' = 0, 'branch' = 1, 'tag' = 2, 'repository' = 3, 'unknown' = 4),
`creator_user_login` LowCardinality(String),
`number` UInt32,
`title` String,
`labels` Array(LowCardinality(String)),
`state` Enum8('none' = 0, 'open' = 1, 'closed' = 2),
`locked` UInt8,
`assignee` LowCardinality(String),
`assignees` Array(LowCardinality(String)),
`comments` UInt32,
`author_association` Enum8('NONE' = 0, 'CONTRIBUTOR' = 1, 'OWNER' = 2, 'COLLABORATOR' = 3, 'MEMBER' = 4, 'MANNEQUIN' = 5),
`closed_at` DateTime,
`merged_at` DateTime,
`merge_commit_sha` String,
`requested_reviewers` Array(LowCardinality(String)),
`requested_teams` Array(LowCardinality(String)),
`head_ref` LowCardinality(String),
`head_sha` String,
`base_ref` LowCardinality(String),
`base_sha` String,
`merged` UInt8,
`mergeable` UInt8,
`rebaseable` UInt8,
`mergeable_state` Enum8('unknown' = 0, 'dirty' = 1, 'clean' = 2, 'unstable' = 3, 'draft' = 4),
`merged_by` LowCardinality(String),
`review_comments` UInt32,
`maintainer_can_modify` UInt8,
`commits` UInt32,
`additions` UInt32,
`deletions` UInt32,
`changed_files` UInt32,
`diff_hunk` String,
`original_position` UInt32,
`commit_id` String,
`original_commit_id` String,
`push_size` UInt32,
`push_distinct_size` UInt32,
`member_login` LowCardinality(String),
`release_tag_name` String,
`release_name` String,
`review_state` Enum8('none' = 0, 'approved' = 1, 'changes_requested' = 2, 'commented' = 3, 'dismissed' = 4, 'pending' = 5)
)
ENGINE = MergeTree
ORDER BY (event_type, repo_name, created_at)
SETTINGS disk = disk(type = web, endpoint = 'http://clickhouse-public-datasets.s3.amazonaws.com/web/')
create table github_events_all on cluster '{cluster}' as github_events ENGINE = Distributed('{cluster}', default, github_events)
Single node:
SELECT uniq(repo_name)
FROM github_events
SETTINGS max_parallel_replicas = 1, allow_experimental_parallel_reading_from_replicas = 0, prefer_localhost_replica = 1
Query id: bdd4b17d-57b2-456e-ade5-3447e912c992
┌─uniq(repo_name)─┐
│ 262270450 │
└─────────────────┘
1 row in set. Elapsed: 46.513 sec. Processed 5.06 billion rows, 10.09 GB (108.77 million rows/s., 216.98 MB/s.)
Distributed table, 6 replicas, 1 shard:
SELECT uniq(repo_name)
FROM github_events_all
SETTINGS max_parallel_replicas = 6, allow_experimental_parallel_reading_from_replicas = 1, prefer_localhost_replica = 0
Query id: 855452b0-ed4a-4742-a288-dcf6dcb4e9b6
┌─uniq(repo_name)─┐
│ 262270450 │
└─────────────────┘
1 row in set. Elapsed: 71.592 sec. Processed 30.36 billion rows, 60.55 GB (424.02 million rows/s., 845.84 MB/s.)
You can see it processes 6x as many rows and data, and takes substantially longer.
Expected behavior
Should divide the work among replicas, and be faster
Additional context
I had to use a distributed table, was not using parallel replicas without the distributed table. I also had to provide those settings on every query as SET would not cause it to use all nodes.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 24 (20 by maintainers)
Also keep in mind, that you have to create a VPC endpoint for S3 in AWS, otherwise, the total network throughput will be limited to 100 Gbit/sec.