ClickHouse: Clickhouse Keeper v22.7 crashing on every startup with segfault 11

Describe what’s wrong

Clickhouse keeper is crashing during initialization with segfault. This repeats on every start - we’re unable to start the keeper server, either from existing data or from the new ones, it always crashes after a few minutes of initializing. We’re running a standalone dockerized keeper instance.

Does it reproduce on recent release?

Yes, we’re using the version v22.7 of keeper.

How to reproduce

  • Which ClickHouse server version to use:

We tested these versions (so far) and all are crashing with the same error

v22.7.1.2484
v22.7.4.16
v22.8.2.11
---
v22.5.1.2079 
  • Non-default settings, if any:

We tweaked some configurations which may be related, though it’s inconclusive on our side:

    <coordination_settings>
      <force_sync>false</force_sync>
      <max_requests_batch_size>2000</max_requests_batch_size>
    </coordination_settings>

Expected behavior

We don’t expect the process to crash.

Error message and/or stacktrace

successfully receive a snapshot (idx 4149762924 term 335) from leader
Compact logs up to log index 4149762924, our max log id is 4148717923
Seems like this node recovers from leaders snapshot, removing all logs
Removing changelog /var/lib/clickhouse/coordination/log/changelog_4148562925_4148662924.bin.zstd because of compaction
Trying to remove log /var/lib/clickhouse/coordination/log/changelog_4148662925_4148762924.bin.zstd which is current active log for write. Possibly this node recovers from snapshot
Removing changelog /var/lib/clickhouse/coordination/log/changelog_4148662925_4148762924.bin.zstd because of compaction
Removed changelog /var/lib/clickhouse/coordination/log/changelog_4148562925_4148662924.bin.zstd because of compaction.
Removed changelog /var/lib/clickhouse/coordination/log/changelog_4148662925_4148762924.bin.zstd because of compaction.
Compaction up to 4149762924 finished new min index 4149762925, new max index 4149762924
successfully compact the log store, will now ask the statemachine to apply the snapshot
########################################
(version 22.7.1.2484 (official build), build id: BB14295F0BE31ECF) (from thread 66) (no query) Received signal Segmentation fault (11)
Address: NULL pointer. Access: read. Address not mapped to object.
Stack trace: 0xa5860d
0. ? @ 0xa5860d in /usr/bin/clickhouse-keeper
Integrity check of the executable skipped because the reference checksum could not be read. (calculated checksum: E4590F1FEA25C5B140060D818924BBD1)

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 18 (18 by maintainers)

Most upvoted comments

After a more detailed look we concluded that the problem described in this issue is solved with https://github.com/ClickHouse/ClickHouse/pull/40627

The further problem we had was connected to the performance of the Keeper itself. Even with some performance improvements, it was still not enough for the follower to catch up with the cluster (cluster is processing request at a much faster pace than the speed of synching) which requires a closer look.

I wonder, could it be that this new instance is trying to “catch up”, but fails to do so quickly enough for some reason, and then the different mechanism kicks in (Seems like this node recovers from leaders snapshot) which leads to the segfault in this case?

@gyfis It seems you got it right. It’s trying to commit local logs on startup but while doing that leader deduces that it needs to apply snapshot. Applying snapshot calls compaction, something that deletes logs. Committing is done in a background thread. So the bg commit thread tries to get a deleted log and gets nullptr.

I still can’t reproduce it locally so this is a really good catch. Thanks for reporting! Need to figure out what’s the best way of handling it.

@gyfis can you targzip files with keeper data (snapshots + logs) and share them ( or send them ) to support@clickhouse.com ?

Yep, just sent it over. Thanks a lot!

@gyfis can you targzip files with keeper data (snapshots + logs) and share them ( or send them ) to support@clickhouse.com ?