ClickHouse: Clickhouse Keeper v22.7 crashing on every startup with segfault 11
Describe what’s wrong
Clickhouse keeper is crashing during initialization with segfault. This repeats on every start - we’re unable to start the keeper server, either from existing data or from the new ones, it always crashes after a few minutes of initializing. We’re running a standalone dockerized keeper instance.
Does it reproduce on recent release?
Yes, we’re using the version v22.7 of keeper.
How to reproduce
- Which ClickHouse server version to use:
We tested these versions (so far) and all are crashing with the same error
v22.7.1.2484
v22.7.4.16
v22.8.2.11
---
v22.5.1.2079
- Non-default settings, if any:
We tweaked some configurations which may be related, though it’s inconclusive on our side:
<coordination_settings>
<force_sync>false</force_sync>
<max_requests_batch_size>2000</max_requests_batch_size>
</coordination_settings>
Expected behavior
We don’t expect the process to crash.
Error message and/or stacktrace
successfully receive a snapshot (idx 4149762924 term 335) from leader
Compact logs up to log index 4149762924, our max log id is 4148717923
Seems like this node recovers from leaders snapshot, removing all logs
Removing changelog /var/lib/clickhouse/coordination/log/changelog_4148562925_4148662924.bin.zstd because of compaction
Trying to remove log /var/lib/clickhouse/coordination/log/changelog_4148662925_4148762924.bin.zstd which is current active log for write. Possibly this node recovers from snapshot
Removing changelog /var/lib/clickhouse/coordination/log/changelog_4148662925_4148762924.bin.zstd because of compaction
Removed changelog /var/lib/clickhouse/coordination/log/changelog_4148562925_4148662924.bin.zstd because of compaction.
Removed changelog /var/lib/clickhouse/coordination/log/changelog_4148662925_4148762924.bin.zstd because of compaction.
Compaction up to 4149762924 finished new min index 4149762925, new max index 4149762924
successfully compact the log store, will now ask the statemachine to apply the snapshot
########################################
(version 22.7.1.2484 (official build), build id: BB14295F0BE31ECF) (from thread 66) (no query) Received signal Segmentation fault (11)
Address: NULL pointer. Access: read. Address not mapped to object.
Stack trace: 0xa5860d
0. ? @ 0xa5860d in /usr/bin/clickhouse-keeper
Integrity check of the executable skipped because the reference checksum could not be read. (calculated checksum: E4590F1FEA25C5B140060D818924BBD1)
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 18 (18 by maintainers)
After a more detailed look we concluded that the problem described in this issue is solved with https://github.com/ClickHouse/ClickHouse/pull/40627
The further problem we had was connected to the performance of the Keeper itself. Even with some performance improvements, it was still not enough for the follower to catch up with the cluster (cluster is processing request at a much faster pace than the speed of synching) which requires a closer look.
@gyfis It seems you got it right. It’s trying to commit local logs on startup but while doing that leader deduces that it needs to apply snapshot. Applying snapshot calls compaction, something that deletes logs. Committing is done in a background thread. So the bg commit thread tries to get a deleted log and gets nullptr.
I still can’t reproduce it locally so this is a really good catch. Thanks for reporting! Need to figure out what’s the best way of handling it.
Yep, just sent it over. Thanks a lot!
@gyfis can you targzip files with keeper data (snapshots + logs) and share them ( or send them ) to support@clickhouse.com ?