ClickHouse: XID becomes negative and cannot be recovered
Environment: Two replicas per shard. two clickhouse-server and one clickhouse-keeper.
When I use clickhosue-keeper, clickhosue-server’s log keeps reporting errors like this:
<Error> void DB::AsynchronousMetrics::run(): Code: 999. Coordination::Exception: XID overflow (Session expired). (KEEPER_EXCEPTION)
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 24 (18 by maintainers)
Also: I thoght that the zookeeper protocol was needed only to simplify the A/B testing, and actually we can expand it as we want now, not following the main API. Otherwise we will keep importing all problems & limitations of zookeeper to keeper.
For example: we can change XID, or we can introduce some higher level /clickhouse-specific API calls (like some singe-operation call to do things like findReplicaHavingCoveringPart, to avoid silly amounts of trafic exchange between Keeper and ClickHouse - see https://github.com/ClickHouse/ClickHouse/issues/21338)
OK. Probably the false memory. I was under the impression that keeper was advertised as “XID overflow” free.
@den-crane you confused it with
ZXIDwhich is internal counter and can be changed by Keeper toint64whileXIDis part of the protocol and we have to useint32like ZK.@helifu I agree that it’s not completely harmless as it will still force some operations to fail. But background operations like
mergeswill finish when a connection is established again without user intervention. For inserts, they can be harder to handle correctly in those cases but we introducedinsert_keeper_max_retriesto try and mitigate those problems. In all cases, CH should try to handle reconnects as gracefully as possible, the best-case scenario being the user not even noticing. If you have some specific issues please feel free to tell and create an issue.No, it doesn’t, the error is produced by our internal ZK client which will reconnect after some time. Also,
AsynchronousMetricssimply collects some information in the background and it doesn’t affect other operations.From 22.10+ you should see a more reactive behavior to those kinds of issues and not have 60 seconds of expired logs.