ClickHouse: XID becomes negative and cannot be recovered

Environment: Two replicas per shard. two clickhouse-server and one clickhouse-keeper.

When I use clickhosue-keeper, clickhosue-server’s log keeps reporting errors like this: <Error> void DB::AsynchronousMetrics::run(): Code: 999. Coordination::Exception: XID overflow (Session expired). (KEEPER_EXCEPTION)

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 24 (18 by maintainers)

Most upvoted comments

Also: I thoght that the zookeeper protocol was needed only to simplify the A/B testing, and actually we can expand it as we want now, not following the main API. Otherwise we will keep importing all problems & limitations of zookeeper to keeper.

For example: we can change XID, or we can introduce some higher level /clickhouse-specific API calls (like some singe-operation call to do things like findReplicaHavingCoveringPart, to avoid silly amounts of trafic exchange between Keeper and ClickHouse - see https://github.com/ClickHouse/ClickHouse/issues/21338)

OK. Probably the false memory. I was under the impression that keeper was advertised as “XID overflow” free.

@den-crane you confused it with ZXID which is internal counter and can be changed by Keeper to int64 while XID is part of the protocol and we have to use int32 like ZK.

@helifu I agree that it’s not completely harmless as it will still force some operations to fail. But background operations like merges will finish when a connection is established again without user intervention. For inserts, they can be harder to handle correctly in those cases but we introduced insert_keeper_max_retries to try and mitigate those problems. In all cases, CH should try to handle reconnects as gracefully as possible, the best-case scenario being the user not even noticing. If you have some specific issues please feel free to tell and create an issue.

Does the error have any negative impact on the CH server?

No, it doesn’t, the error is produced by our internal ZK client which will reconnect after some time. Also, AsynchronousMetrics simply collects some information in the background and it doesn’t affect other operations.

From 22.10+ you should see a more reactive behavior to those kinds of issues and not have 60 seconds of expired logs.