ClickHouse: clickhouse-keeper cluster not working

Describe what’s wrong I setup clickhouse cluster on three nodes and setup clickhouse-keeper on the same nodes according to the official documents on three virtual machines. the version is :

ClickHouse server version 23.10.1.1976 (official build) ClickHouse server version 23.10.1.1976 (official build)

what works not as it is supposed to. the result of SELECT * FROM system.zookeeper WHERE path IN ('/', '/clickhouse'); is:

2023.11.06 15:54:29.902052 [ 4478 ] {} <Error> virtual bool DB::DDLWorker::initializeMainThread(): Code: 999. Coordination::Exception: All connection tries failed while connecting to ZooKeeper. nodes: 1.1.110.1:9181, 1.1.110.3:9181, 1.1.110.2:9181 Poco::Exception. Code: 1000, e.code() = 111, Connection refused (version 23.10.1.1976 (official build)), 1.1.110.1:9181 Poco::Exception. Code: 1000, e.code() = 111, Connection refused (version 23.10.1.1976 (official build)), 1.1.110.3:9181 Poco::Exception. Code: 1000, e.code() = 111, Connection refused (version 23.10.1.1976 (official build)), 1.1.110.2:9181 Poco::Exception. Code: 1000, e.code() = 111, Connection refused (version 23.10.1.1976 (official build)), 1.1.110.1:9181 Poco::Exception. Code: 1000, e.code() = 111, Connection refused (version 23.10.1.1976 (official build)), 1.1.110.3:9181 Poco::Exception. Code: 1000, e.code() = 111, Connection refused (version 23.10.1.1976 (official build)), 1.1.110.2:9181 Poco::Exception. Code: 1000, e.code() = 111, Connection refused (version 23.10.1.1976 (official build)), 1.1.110.1:9181 Poco::Exception. Code: 1000, e.code() = 111, Connection refused (version 23.10.1.1976 (official build)), 1.1.110.3:9181 Poco::Exception. Code: 1000, e.code() = 111, Connection refused (version 23.10.1.1976 (official build)), 1.1.110.2:9181 . (KEEPER_EXCEPTION), Stack trace (when copying this message, always include the lines below):

  1. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000cdd11b7 in /usr/bin/clickhouse
  2. DB::Exception::Exception<String&>(int, FormatStringHelperImpl<std::type_identity<String&>::type>, String&) @ 0x00000000079a030d in /usr/bin/clickhouse
  3. Coordination::Exception::Exception<String&>(Coordination::Error, FormatStringHelperImpl<std::type_identity<String&>::type>, String&) @ 0x0000000013018e4a in /usr/bin/clickhouse
  4. Coordination::ZooKeeper::ZooKeeper(std::vector<Coordination::ZooKeeper::Node, std::allocatorCoordination::ZooKeeper::Node> const&, zkutil::ZooKeeperArgs const&, std::shared_ptrDB::ZooKeeperLog) @ 0x00000000141aeda9 in /usr/bin/clickhouse
  5. zkutil::ZooKeeper::init(zkutil::ZooKeeperArgs) @ 0x000000001415f6ab in /usr/bin/clickhouse
  6. zkutil::ZooKeeper::ZooKeeper(Poco::Util::AbstractConfiguration const&, String const&, std::shared_ptrDB::ZooKeeperLog) @ 0x0000000014162b33 in /usr/bin/clickhouse
  7. DB::Context::getZooKeeper() const @ 0x0000000011ede78b in /usr/bin/clickhouse
  8. DB::DDLWorker::getAndSetZooKeeper() @ 0x0000000011f60eed in /usr/bin/clickhouse
  9. DB::DDLWorker::initializeMainThread() @ 0x0000000011f7151b in /usr/bin/clickhouse
  10. DB::DDLWorker::runMainThread() @ 0x0000000011f5be20 in /usr/bin/clickhouse
  11. void std::__function::__policy_invoker<void ()>::__call_impl<std::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<true>::ThreadFromGlobalPoolImpl<void (DB::DDLWorker::)(), DB::DDLWorker>(void (DB::DDLWorker::&&)(), DB::DDLWorker&&)::‘lambda’(), void ()>>(std::__function::__policy_storage const*) @ 0x0000000011f72a8c in /usr/bin/clickhouse
  12. void* std::__thread_proxy[abi:v15000]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_deletestd::__thread_struct>, void ThreadPoolImplstd::thread::scheduleImpl<void>(std::function<void ()>, Priority, std::optional<unsigned long>, bool)::‘lambda0’()>>(void*) @ 0x000000000cebc6a7 in /usr/bin/clickhouse
  13. start_thread @ 0x00000000000076db in /lib/x86_64-linux-gnu/libpthread-2.27.so
  14. ? @ 0x000000000012161f in /lib/x86_64-linux-gnu/libc-2.27.so (version 23.10.1.1976 (official build))

Does it reproduce on recent release? yes I used the latest verion

Enable crash reporting there isn’t crash but keepers are connect to each other. when I run ss command I got:

tcp LISTEN 0 64 127.0.0.1:9181 0.0.0.0:* users:((“clickhouse-keep”,pid=8749,fd=37))
tcp LISTEN 0 64 [::1]:9181 [::]😗 users:((“clickhouse-keep”,pid=8749,fd=36)

How to reproduce I go through the official documents.

  • Which ClickHouse server version to use ClickHouse keeper version 23.10.1.1976 (official build). ClickHouse server version 23.10.1.1976 (official build)

  • Which interface to use, if matters I’m on ubuntu 18.04

  • Non-default settings, if any

  • CREATE TABLE statements for all tables involved

  • Sample data for all these tables, use clickhouse-obfuscator if necessary

  • Queries to run that lead to unexpected result

Expected behavior

A clear and concise description of what you expected to happen.

Error message and/or stacktrace

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here. my issue nearly the same as this but I don’t use docker. more: there isn’t any network issue or firewall.

clickhouser-server config is:

    <remote_servers>
        <default>
            <shard>
                <replica>
                    <host>localhost</host>
                    <port>9000</port>
                </replica>
            </shard>
        </default>
        <cluster_2>
          <shard>
            <replica>
              <host>clickhouse1</host>
              <port>9000</port>
            </replica>
            <replica>
              <host>clickhouse2</host>
              <port>9000</port>
            </replica>
            <replica>
              <host>clickhouse3</host>
              <port>9000</port>
            </replica>
          </shard>
        </cluster_2>
    </remote_servers>
    <zookeeper>
        <node>
            <host>keeper1</host>
            <port>9181</port>
        </node>
        <node>
            <host>keeper2</host>
            <port>9181</port>
        </node>
        <node>
            <host>keeper3</host>
            <port>9181</port>
        </node>
    </zookeeper>

and keeper:

    <keeper_server>
            <listen_host>0.0.0.0</listen_host>
            <tcp_port>9181</tcp_port>
            <server_id>1</server_id>
            <log_storage_path>/var/lib/clickhouse/coordination/logs</log_storage_path>
            <snapshot_storage_path>/var/lib/clickhouse/coordination/snapshots</snapshot_storage_path>
            <coordination_settings>
                <operation_timeout_ms>10000</operation_timeout_ms>
                <min_session_timeout_ms>10000</min_session_timeout_ms>
                <session_timeout_ms>100000</session_timeout_ms>
                <raft_logs_level>trace</raft_logs_level>
            </coordination_settings>
            <!-- <hostname_checks_enabled>true</hostname_checks_enabled> -->
            <raft_configuration>
                <server>
                    <id>1</id>
                    <hostname>keeper1</hostname>
                    <port>9234</port>
                </server>

                <server>
                    <id>2</id>
                    <hostname>keeper2</hostname>
                    <port>9234</port>
                </server>
                <server>
                    <id>3</id>
                    <hostname>keeper3</hostname>
                    <port>9234</port>
                </server>
            </raft_configuration>
    </keeper_server>

and /etc/hosts for all nodes are:

1.1.110.1 clickhouse1 keeper1
1.1.110.2 clickhouse1 keeper2
1.1.110.3 clickhouse1 keeper3

many thanks for helping.

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Comments: 24 (11 by maintainers)

Most upvoted comments

I will close the issue because the problem is solved.

what is ‘but you would need to separate a bit the storage for Keeper and ClickHouse and make sure different configs are used.’

CH and Keeper use same config so if you run server and standalone Keeper on the same machine while the same config is accessible for both of them you risk for the server to start embedded Keeper with same configuration as the standalone Keeper. You should simply make sure that config for standalone Keeper is not in the same folder as the config.xml for the server.

that’s correct. the result of SELECT * FROM system.zookeeper WHERE path IN ('/', '/clickhouse'); is OK now. many many thanks for your quick support. I’m appreciated if you add this to the document.

you meant I must use 6 machine?

of course not, but you would need to separate a bit the storage for Keeper and ClickHouse and make sure different configs are used. If you are already running on the same machine a viable option for testing is to use embedded Keeper also which will run as part of the CH process.

but nc clickhouse1 9181 or nc clickhouse2 9181 or … not working. I’m sure there isn’t any network or firewall issue.

Try setting in your Keeper config

<clickhouse>
    <listen_host>::</listen_host>
</clickhouse>