ClickHouse: Kafka consumer stops - cppkafka::HandleException Local: Timed out

Describe the bug ClickHouse Kafka consumer stops consuming from Kafka, and restarts after some time (from 0 up to 2 hours)… Restart of ClickHouse docker container helps imediately. Problem started on 19.9.2.4, works ok on 19.8.3.8. Configuration (.conf and tables) are same for both versions.

How to reproduce problem is here - official docker image: yandex/clickhouse-server:19.9.2.4 and this works ok: official docker image: yandex/clickhouse-server:19.8.3.8

Running on single server (no sharding/replication). Total 20 kafka topic->mv->table definitions. Pulling from 2 kafka servers with max 2 partitions per topic. Kafka engine with num_consumers: 1

ENGINE = Kafka SETTINGS kafka_broker_list = '', kafka_topic_list = 'eventdata', kafka_group_name = '', kafka_format = 'CSV', kafka_num_consumers = 1,kafka_row_delimiter = '\n',kafka_skip_broken_messages = 1; kafka broker list defined in kafka.conf (see below)

Expected behavior Same behavior after CH version change.

Error message and/or stacktrace clickhouse-server.err.log

019.07.03 07:11:12.581555 [ 1 ] {} <Error> Application: Listen [::]:8123 failed: 0: DNS error: EAI: -9  If it is an IPv6 or IPv4 address and your host has disabled IPv6 or IPv4, then consider to specify not disabled IPv4 or IPv6 address to listen in <listen_host> element of configuration file. Example for disabled IPv6: <listen_host>0.0.0.0</listen_host> . Example for disabled IPv4: <listen_host>::</listen_host>
2019.07.03 07:26:32.664575 [ 39 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.9.2.4 (official build)
2019.07.03 07:27:23.068116 [ 43 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.9.2.4 (official build)
2019.07.03 07:33:12.821698 [ 39 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.9.2.4 (official build)

Additional context Add any other context about the problem here.

clickhouse-server.log


2019.07.03 07:26:32.309536 [ 40 ] {} <Trace> StorageKafka (imps_raw_queueraw): Polled batch of 165 messages
2019.07.03 07:26:32.318413 [ 43 ] {} <Trace> StorageKafka (dsprtbbids_raw_queueraw): Polled batch of 92 messages
2019.07.03 07:26:32.318422 [ 41 ] {} <Trace> StorageKafka (prebiddetaileventdata_queueraw): Polled batch of 4821 messages
2019.07.03 07:26:32.407451 [ 32 ] {} <Debug> StorageKafka (dspimpsvisibility_raw_queueraw): Started streaming to 1 attached views
2019.07.03 07:26:32.407450 [ 34 ] {} <Debug> StorageKafka (dspsegments_raw_queueraw): Started streaming to 1 attached views
2019.07.03 07:26:32.408022 [ 35 ] {} <Trace> StorageKafka (dspconversions_attributed_raw_queueraw): Polled batch of 0 messages
2019.07.03 07:26:32.408072 [ 35 ] {} <Trace> BlockInputStreamFromRowInputStream: Skipped 0 rows with errors while reading the input stream
2019.07.03 07:26:32.448213 [ 31 ] {} <Trace> StorageKafka (ssprtbbids_raw_queueraw): Polled batch of 198 messages
2019.07.03 07:26:32.469842 [ 35 ] {} <Trace> StorageKafka (dspconversions_attributed_raw_queueraw): Committed offset -1001 (topic: dspconversions_attributed_raw, partition: 0)
2019.07.03 07:26:32.469886 [ 35 ] {} <Trace> StorageKafka (dspconversions_attributed_raw_queueraw): Committed offset -1001 (topic: dspconversions_attributed_raw, partition: 1)
2019.07.03 07:26:32.470060 [ 35 ] {} <Trace> StorageKafka (dspconversions_attributed_raw_queueraw): Execution took 562 ms.
2019.07.03 07:26:32.470127 [ 35 ] {} <Debug> StorageKafka (dsprtbcampaigneligibility_raw_queueraw): Started streaming to 1 attached views
2019.07.03 07:26:32.483302 [ 33 ] {} <Trace> StorageKafka (ssprtbrequests_raw_queueraw): Polled batch of 469 messages
2019.07.03 07:26:32.518467 [ 38 ] {} <Trace> StorageKafka (imps_ml_raw_queueraw): Polled batch of 392 messages
2019.07.03 07:26:32.564322 [ 30 ] {} <Trace> StorageKafka (sspimps_raw_queueraw): Polled batch of 267 messages
2019.07.03 07:26:32.620490 [ 42 ] {} <Trace> StorageKafka (ssprtbadirequests_raw_queueraw): Polled batch of 126 messages
2019.07.03 07:26:32.664157 [ 39 ] {} <Trace> StorageKafka (clks_raw_queueraw): Re-joining claimed consumer after failure
2019.07.03 07:26:32.664575 [ 39 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.9.2.4 (official build)
2019.07.03 07:26:32.664677 [ 39 ] {} <Trace> StorageKafka (clks_raw_queueraw): Execution took 6011 ms.
2019.07.03 07:26:32.734845 [ 36 ] {} <Trace> StorageKafka (dspshop_events_raw_queueraw): Committed offset 4563474446 (topic: dspshop_events_raw, partition: 0)
2019.07.03 07:26:32.734883 [ 36 ] {} <Trace> StorageKafka (dspshop_events_raw_queueraw): Committed offset 4479299920 (topic: dspshop_events_raw, partition: 1)
2019.07.03 07:26:32.735971 [ 36 ] {} <Trace> InterpreterSelectQuery: FetchColumns -> Complete
2019.07.03 07:26:32.742137 [ 36 ] {} <Trace> test5dsp.dspshop_events_raw_raw: Renaming temporary part tmp_insert_20190703_285_285_0 to 20190703_382482_382482_0.
2019.07.03 07:26:32.743435 [ 36 ] {} <Trace> InterpreterSelectQuery: FetchColumns -> Complete
2019.07.03 07:26:32.743566 [ 36 ] {} <Trace> Aggregator: Aggregating
2019.07.03 07:26:32.743632 [ 36 ] {} <Trace> Aggregator: Aggregation method: serialized
2019.07.03 07:26:32.743669 [ 36 ] {} <Trace> Aggregator: Aggregated. 10 to 7 rows (from 0.001 MiB) in 0.000 sec. (129720.194 rows/sec., 18.705 MiB/sec.)

docker log with Kafka debug all:

{"log":"%7|1562138792.662|RECV|ClickHouse 19.9.2.4#consumer-6| [thrd:xxx.xxx.xxx.36:9092/2]: xxx.xxx.xxx.36:9092/2: Received FetchResponse (v4, 73 bytes, CorrId 8515, rtt 202.16ms)\n","stream":"stderr","t
ime":"2019-07-03T07:26:32.663047993Z"}
{"log":"%7|1562138792.662|FETCH|ClickHouse 19.9.2.4#consumer-6| [thrd:xxx.xxx.xxx.36:9092/2]: xxx.xxx.xxx.36:9092/2: Topic dsprtbcampaigneligibility_raw [1] MessageSet size 0, error \"Success\", MaxOffset
 21257146, Ver 6/6\n","stream":"stderr","time":"2019-07-03T07:26:32.663054736Z"}
{"log":"%7|1562138792.662|FETCH|ClickHouse 19.9.2.4#consumer-6| [thrd:xxx.xxx.xxx.36:9092/2]: xxx.xxx.xxx.36:9092/2: Fetch topic dsprtbcampaigneligibility_raw [1] at offset 21257146 (v6)\n","stream":"stde
rr","time":"2019-07-03T07:26:32.663061956Z"}
{"log":"%7|1562138792.662|FETCH|ClickHouse 19.9.2.4#consumer-6| [thrd:xxx.xxx.xxx.36:9092/2]: xxx.xxx.xxx.36:9092/2: Fetch 1/1/1 toppar(s)\n","stream":"stderr","time":"2019-07-03T07:26:32.663069107Z"}
{"log":"%7|1562138792.663|SEND|ClickHouse 19.9.2.4#consumer-6| [thrd:xxx.xxx.xxx.36:9092/2]: xxx.xxx.xxx.36:9092/2: Sent FetchRequest (v4, 105 bytes @ 0, CorrId 8516)\n","stream":"stderr","time":"2019-07-
03T07:26:32.663103313Z"}
{"log":"%7|1562138792.663|RECV|ClickHouse 19.9.2.4#consumer-14| [thrd:GroupCoordinator]: GroupCoordinator/2: Received HeartbeatResponse (v0, 2 bytes, CorrId 539, rtt 100.45ms)\n","stream":"stderr","time":
"2019-07-03T07:26:32.663254983Z"}
{"log":"%7|1562138792.663|RECV|ClickHouse 19.9.2.4#consumer-18| [thrd:GroupCoordinator]: GroupCoordinator/2: Received HeartbeatResponse (v0, 2 bytes, CorrId 535, rtt 92.35ms)\n","stream":"stderr","time":"
2019-07-03T07:26:32.663423782Z"}
{"log":"%7|1562138792.664|CGRPOP|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: Group \"chraw\" received op SUBSCRIBE (v0) in state up (join state started, v9 vs 0)\n","stream":"stderr","time":"2019-07-03T0
7:26:32.664361626Z"}
{"log":"%7|1562138792.664|SUBSCRIBE|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: Group \"chraw\": subscribe to new unset subscription of 0 topics (join state started)\n","stream":"stderr","time":"2019-07-
03T07:26:32.664374526Z"}
{"log":"%7|1562138792.664|UNSUBSCRIBE|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: Group \"chraw\": unsubscribe from current subscription of 1 topics (leave group=yes, join state started, v9)\n","stream":
"stderr","time":"2019-07-03T07:26:32.664378946Z"}
{"log":"%7|1562138792.664|SUBSCRIPTION|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: Group \"chraw\": clearing subscribed topics list (1)\n","stream":"stderr","time":"2019-07-03T07:26:32.66438313Z"}
{"log":"%7|1562138792.664|SUBSCRIPTION|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: Group \"chraw\": effective subscription list changed from 1 to 0 topic(s):\n","stream":"stderr","time":"2019-07-03T07:26
:32.664387171Z"}
{"log":"%7|1562138792.664|GRPLEADER|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: Group \"chraw\": resetting group leader info: unsubscribe\n","stream":"stderr","time":"2019-07-03T07:26:32.664390882Z"}
{"log":"%7|1562138792.664|REBALANCE|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: Group \"chraw\" is rebalancing in state up (join-state started) with assignment: unsubscribe\n","stream":"stderr","time":"2
019-07-03T07:26:32.664394693Z"}
{"log":"%7|1562138792.664|PAUSE|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: Library pausing 2 partition(s)\n","stream":"stderr","time":"2019-07-03T07:26:32.664398486Z"}
{"log":"%7|1562138792.664|BARRIER|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: clks_raw [0]: rd_kafka_toppar_op_pause_resume:2311: new version barrier v7\n","stream":"stderr","time":"2019-07-03T07:26:32.6
64401915Z"}
{"log":"%7|1562138792.664|PAUSE|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: Pause clks_raw [0] (v7)\n","stream":"stderr","time":"2019-07-03T07:26:32.664405503Z"}
{"log":"%7|1562138792.664|BARRIER|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: clks_raw [1]: rd_kafka_toppar_op_pause_resume:2311: new version barrier v7\n","stream":"stderr","time":"2019-07-03T07:26:32.6
64409703Z"}
{"log":"%7|1562138792.664|PAUSE|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: Pause clks_raw [1] (v7)\n","stream":"stderr","time":"2019-07-03T07:26:32.664418313Z"}
{"log":"%7|1562138792.664|ASSIGN|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: Group \"chraw\": delegating revoke of 2 partition(s) to application rebalance callback on queue rd_kafka_cgrp_new: unsubscribe
\n","stream":"stderr","time":"2019-07-03T07:26:32.664429672Z"}
{"log":"%7|1562138792.664|CGRPJOINSTATE|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: Group \"chraw\" changed join state started -\u003e wait-revoke-rebalance_cb (v9, state up)\n","stream":"stderr","time":
"2019-07-03T07:26:32.664436841Z"}
{"log":"%7|1562138792.664|OP|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: clks_raw [0] received op PAUSE (v7) in fetch-state active (opv6)\n","stream":"stderr","time":"2019-07-03T07:26:32.664446775Z"}
{"log":"%7|1562138792.664|PAUSE|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: Pause clks_raw [0]: at offset 32000421 (state active, v7)\n","stream":"stderr","time":"2019-07-03T07:26:32.664481843Z"}
{"log":"%7|1562138792.664|OP|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: clks_raw [1] received op PAUSE (v7) in fetch-state active (opv6)\n","stream":"stderr","time":"2019-07-03T07:26:32.664492452Z"}
{"log":"%7|1562138792.664|PAUSE|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: Pause clks_raw [1]: at offset 31809479 (state active, v7)\n","stream":"stderr","time":"2019-07-03T07:26:32.66449908Z"}
{"log":"%7|1562138792.666|RECV|ClickHouse 19.9.2.4#consumer-8| [thrd:xxx.xxx.xxx.36:9092/2]: xxx.xxx.xxx.36:9092/2: Received FetchResponse (v4, 3160 bytes, CorrId 22694, rtt 146.84ms)\n","stream":"stderr"
,"time":"2019-07-03T07:26:32.666724972Z"}
{"log":"%7|1562138792.666|FETCH|ClickHouse 19.9.2.4#consumer-8| [thrd:xxx.xxx.xxx.36:9092/2]: xxx.xxx.xxx.36:9092/2: Topic dsprtbrequests_raw [1] MessageSet size 3098, error \"Success\", MaxOffset 1188248
3842, Ver 6/6\n","stream":"stderr","time":"2019-07-03T07:26:32.666747395Z"}
{"log":"%7|1562138792.666|CONSUME|ClickHouse 19.9.2.4#consumer-8| [thrd:xxx.xxx.xxx.36:9092/2]: xxx.xxx.xxx.36:9092/2: Enqueue 114 message(s) (18571 bytes, 114 ops) on dsprtbrequests_raw [1] fetch queue (
qlen 114, v6, last_offset 11882483841, 0 ctrl msgs)\n","stream":"stderr","time":"2019-07-03T07:26:32.666837521Z"}
{"log":"%7|1562138792.666|FETCH|ClickHouse 19.9.2.4#consumer-8| [thrd:xxx.xxx.xxx.36:9092/2]: xxx.xxx.xxx.36:9092/2: Fetch topic dsprtbrequests_raw [1] at offset 11882483842 (v6)\n","stream":"stderr","tim
e":"2019-07-03T07:26:32.666870301Z"}
{"log":"%7|1562138792.666|FETCH|ClickHouse 19.9.2.4#consumer-8| [thrd:xxx.xxx.xxx.36:9092/2]: xxx.xxx.xxx.36:9092/2: Fetch 1/1/1 toppar(s)\n","stream":"stderr","time":"2019-07-03T07:26:32.666878934Z"}
{"log":"%7|1562138792.666|SEND|ClickHouse 19.9.2.4#consumer-8| [thrd:xxx.xxx.xxx.36:9092/2]: xxx.xxx.xxx.36:9092/2: Sent FetchRequest (v4, 94 bytes @ 0, CorrId 22695)\n","stream":"stderr","time":"2019-07-
03T07:26:32.666887044Z"}
{"log":"%7|1562138792.666|RECV|ClickHouse 19.9.2.4#consumer-7| [thrd:xxx.xxx.xxx.36:9092/2]: xxx.xxx.xxx.36:9092/2: Received FetchResponse (v4, 10871 bytes, CorrId 11928, rtt 197.33ms)\n","stream":"stderr
","time":"2019-07-03T07:26:32.667089128Z"}
{"log":"%7|1562138792.667|FETCH|ClickHouse 19.9.2.4#consumer-7| [thrd:xxx.xxx.xxx.36:9092/2]: xxx.xxx.xxx.36:9092/2: Topic dsprtbnotify_raw [0] MessageSet size 10811, error \"Success\", MaxOffset 59520521
68, Ver 6/6\n","stream":"stderr","time":"2019-07-03T07:26:32.667107892Z"}
{"log":"%7|1562138792.667|CONSUME|ClickHouse 19.9.2.4#consumer-7| [thrd:xxx.xxx.xxx.36:9092/2]: xxx.xxx.xxx.36:9092/2: Enqueue 294 message(s) (36662 bytes, 294 ops) on dsprtbnotify_raw [0] fetch queue (ql
en 294, v6, last_offset 5952052167, 0 ctrl msgs)\n","stream":"stderr","time":"2019-07-03T07:26:32.667421414Z"}

current kafka.xml

<?xml version="1.0"?>
<yandex>
    <kafka>
        <auto_offset_reset>earliest</auto_offset_reset>
        <bootstrap_servers>ct1.static.xxx.xx:9092,ct2.static.xxxxx.xx:9092</bootstrap_servers>
        <group_id>chraw</group_id>
        <debug>all</debug>
    </kafka>
</yandex>

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 18 (7 by maintainers)

Most upvoted comments

Looks very similar to #4631 and should be fixed in v19.11+ - on v19.11 I can’t reproduce the issue locally. Also found and fixed in v19.11 the scenario when Kafka queue becomes empty for some time, but afterwards the CH doesn’t read new messages.