strimzi-kafka-operator: ScaleUp and ScaleDown is not working with KRaft mode
Describe the bug Kafka scaleUp from 3 pods to 4 is not working with the following error:
STRIMZI_BROKER_ID=0
Preparing truststore for replication listener
Adding /opt/kafka/cluster-ca-certs/ca.crt to truststore /tmp/kafka/cluster.truststore.p12 with alias ca
Certificate was added to keystore
Preparing truststore for replication listener is complete
Looking for the right CA
Found the right CA: /opt/kafka/cluster-ca-certs/ca.crt
Preparing keystore for replication and clienttls listener
Preparing keystore for replication and clienttls listener is complete
Preparing truststore for client authentication
Adding /opt/kafka/client-ca-certs/ca.crt to truststore /tmp/kafka/clients.truststore.p12 with alias ca
Certificate was added to keystore
Preparing truststore for client authentication is complete
Starting Kafka with configuration:
##############################
##############################
# This file is automatically generated by the Strimzi Cluster Operator
# Any changes to this file will be ignored and overwritten!
##############################
##############################
##########
# Broker ID
##########
broker.id=0
node.id=0
##########
# KRaft configuration
##########
process.roles=broker,controller
controller.listener.names=CONTROLPLANE-9090
controller.quorum.voters=0@my-cluster-5261ed90-kafka-0.my-cluster-5261ed90-kafka-brokers.namespace-0.svc.cluster.local:9090,1@my-cluster-5261ed90-kafka-1.my-cluster-5261ed90-kafka-brokers.namespace-0.svc.cluster.local:9090,2@my-cluster-5261ed90-kafka-2.my-cluster-5261ed90-kafka-brokers.namespace-0.svc.cluster.local:9090,3@my-cluster-5261ed90-kafka-3.my-cluster-5261ed90-kafka-brokers.namespace-0.svc.cluster.local:9090
##########
# Kafka message logs configuration
##########
log.dirs=/var/lib/kafka/data/kafka-log0
##########
# Control Plane listener
##########
listener.name.controlplane-9090.ssl.keystore.location=/tmp/kafka/cluster.keystore.p12
listener.name.controlplane-9090.ssl.keystore.password=[hidden]
listener.name.controlplane-9090.ssl.keystore.type=PKCS12
listener.name.controlplane-9090.ssl.truststore.location=/tmp/kafka/cluster.truststore.p12
listener.name.controlplane-9090.ssl.truststore.password=[hidden]
listener.name.controlplane-9090.ssl.truststore.type=PKCS12
listener.name.controlplane-9090.ssl.client.auth=required
##########
# Replication listener
##########
listener.name.replication-9091.ssl.keystore.location=/tmp/kafka/cluster.keystore.p12
listener.name.replication-9091.ssl.keystore.password=[hidden]
listener.name.replication-9091.ssl.keystore.type=PKCS12
listener.name.replication-9091.ssl.truststore.location=/tmp/kafka/cluster.truststore.p12
listener.name.replication-9091.ssl.truststore.password=[hidden]
listener.name.replication-9091.ssl.truststore.type=PKCS12
listener.name.replication-9091.ssl.client.auth=required
##########
# Listener configuration: PLAIN-9092
##########
##########
# Listener configuration: TLS-9093
##########
listener.name.tls-9093.ssl.keystore.location=/tmp/kafka/cluster.keystore.p12
listener.name.tls-9093.ssl.keystore.password=[hidden]
listener.name.tls-9093.ssl.keystore.type=PKCS12
##########
# Common listener configuration
##########
listeners=CONTROLPLANE-9090://0.0.0.0:9090,REPLICATION-9091://0.0.0.0:9091,PLAIN-9092://0.0.0.0:9092,TLS-9093://0.0.0.0:9093
advertised.listeners=REPLICATION-9091://my-cluster-5261ed90-kafka-0.my-cluster-5261ed90-kafka-brokers.namespace-0.svc:9091,PLAIN-9092://my-cluster-5261ed90-kafka-0.my-cluster-5261ed90-kafka-brokers.namespace-0.svc:9092,TLS-9093://my-cluster-5261ed90-kafka-0.my-cluster-5261ed90-kafka-brokers.namespace-0.svc:9093
listener.security.protocol.map=CONTROLPLANE-9090:SSL,REPLICATION-9091:SSL,PLAIN-9092:PLAINTEXT,TLS-9093:SSL
inter.broker.listener.name=REPLICATION-9091
sasl.enabled.mechanisms=
ssl.secure.random.implementation=SHA1PRNG
ssl.endpoint.identification.algorithm=HTTPS
##########
# User provided configuration
##########
default.replication.factor=3
inter.broker.protocol.version=3.2
log.message.format.version=3.2
min.insync.replicas=2
offsets.topic.replication.factor=3
transaction.state.log.min.isr=2
transaction.state.log.replication.factor=3
Kraft storage is already formatted
+ exec /usr/bin/tini -w -e 143 -- /opt/kafka/bin/kafka-server-start.sh /tmp/strimzi.properties
2022-05-25 08:24:32,210 INFO Registered kafka:type=kafka.Log4jController MBean (kafka.utils.Log4jControllerRegistration$) [main]
2022-05-25 08:24:32,623 INFO Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation (org.apache.zookeeper.common.X509Util) [main]
2022-05-25 08:24:32,845 INFO [LogLoader partition=__cluster_metadata-0, dir=/var/lib/kafka/data/kafka-log0] Recovering unflushed segment 0 (kafka.log.LogLoader) [main]
2022-05-25 08:24:32,847 INFO [LogLoader partition=__cluster_metadata-0, dir=/var/lib/kafka/data/kafka-log0] Loading producer state till offset 0 with message format version 2 (kafka.log.UnifiedLog$) [main]
2022-05-25 08:24:32,847 INFO [LogLoader partition=__cluster_metadata-0, dir=/var/lib/kafka/data/kafka-log0] Reloading from producer snapshot and rebuilding producer state from offset 0 (kafka.log.UnifiedLog$) [main]
2022-05-25 08:24:32,849 INFO Deleted producer state snapshot /var/lib/kafka/data/kafka-log0/__cluster_metadata-0/00000000000000000009.snapshot (kafka.log.SnapshotFile) [main]
2022-05-25 08:24:32,851 INFO [LogLoader partition=__cluster_metadata-0, dir=/var/lib/kafka/data/kafka-log0] Producer state recovery took 3ms for snapshot load and 0ms for segment recovery from offset 0 (kafka.log.UnifiedLog$) [main]
2022-05-25 08:24:32,882 INFO [ProducerStateManager partition=__cluster_metadata-0] Wrote producer snapshot at offset 9 with 0 producer ids in 11 ms. (kafka.log.ProducerStateManager) [main]
2022-05-25 08:24:32,916 INFO [LogLoader partition=__cluster_metadata-0, dir=/var/lib/kafka/data/kafka-log0] Loading producer state till offset 9 with message format version 2 (kafka.log.UnifiedLog$) [main]
2022-05-25 08:24:32,916 INFO [LogLoader partition=__cluster_metadata-0, dir=/var/lib/kafka/data/kafka-log0] Reloading from producer snapshot and rebuilding producer state from offset 9 (kafka.log.UnifiedLog$) [main]
2022-05-25 08:24:32,917 INFO [ProducerStateManager partition=__cluster_metadata-0] Loading producer state from snapshot file 'SnapshotFile(/var/lib/kafka/data/kafka-log0/__cluster_metadata-0/00000000000000000009.snapshot,9)' (kafka.log.ProducerStateManager) [main]
2022-05-25 08:24:32,919 INFO [LogLoader partition=__cluster_metadata-0, dir=/var/lib/kafka/data/kafka-log0] Producer state recovery took 3ms for snapshot load and 0ms for segment recovery from offset 9 (kafka.log.UnifiedLog$) [main]
2022-05-25 08:24:33,319 INFO [raft-expiration-reaper]: Starting (kafka.raft.TimingWheelExpirationService$ExpiredOperationReaper) [raft-expiration-reaper]
2022-05-25 08:24:33,519 ERROR Exiting Kafka due to fatal exception (kafka.Kafka$) [main]
java.lang.IllegalStateException: Configured voter set: [0, 1, 2, 3] is different from the voter set read from the state file: [0, 1, 2]. Check if the quorum configuration is up to date, or wipe out the local state file if necessary
at org.apache.kafka.raft.QuorumState.initialize(QuorumState.java:132)
at org.apache.kafka.raft.KafkaRaftClient.initialize(KafkaRaftClient.java:364)
at kafka.raft.KafkaRaftManager.buildRaftClient(RaftManager.scala:203)
at kafka.raft.KafkaRaftManager.<init>(RaftManager.scala:125)
at kafka.server.KafkaRaftServer.<init>(KafkaRaftServer.scala:76)
at kafka.Kafka$.buildServer(Kafka.scala:79)
at kafka.Kafka$.main(Kafka.scala:87)
at kafka.Kafka.main(Kafka.scala)
To Reproduce Steps to reproduce the behavior:
- Setup CO with KRaft enabled
- Create Kafka CR with 3 replicas
- Scale to 4 replicas
- See error in Kafka pod
Expected behavior A clear and concise description of what you expected to happen.
Environment (please complete the following information):
- Strimzi version: main
- Installation method: YAML
- Kubernetes cluster: OpenShift 4.10
- Infrastructure: Openstack
YAML files and logs Kafka with 3 replicas
apiVersion: v1
items:
- apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
annotations:
strimzi.io/pause-reconciliation: "false"
labels:
test.case: testPauseReconciliationInKafkaAndKafkaConnectWithConnector
name: my-cluster-5261ed90
namespace: namespace-0
spec:
kafka:
config:
default.replication.factor: 3
inter.broker.protocol.version: "3.2"
log.message.format.version: "3.2"
min.insync.replicas: 2
offsets.topic.replication.factor: 3
transaction.state.log.min.isr: 2
transaction.state.log.replication.factor: 3
listeners:
- name: plain
port: 9092
tls: false
type: internal
- name: tls
port: 9093
tls: true
type: internal
logging:
loggers:
kafka.root.logger.level: DEBUG
type: inline
replicas: 3
storage:
deleteClaim: true
size: 1Gi
type: persistent-claim
version: 3.2.0
zookeeper:
logging:
loggers:
zookeeper.root.logger: DEBUG
type: inline
replicas: 3
storage:
deleteClaim: true
size: 1Gi
type: persistent-claim
Kafka with 4 replicas
apiVersion: v1
items:
- apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
annotations:
strimzi.io/pause-reconciliation: "false"
creationTimestamp: "2022-05-25T08:16:08Z"
generation: 2
labels:
test.case: testPauseReconciliationInKafkaAndKafkaConnectWithConnector
name: my-cluster-5261ed90
namespace: namespace-0
resourceVersion: "14487706"
uid: d7843a81-0409-4769-8858-7ad8d6943a2a
spec:
kafka:
config:
default.replication.factor: 3
inter.broker.protocol.version: "3.2"
log.message.format.version: "3.2"
min.insync.replicas: 2
offsets.topic.replication.factor: 3
transaction.state.log.min.isr: 2
transaction.state.log.replication.factor: 3
listeners:
- name: plain
port: 9092
tls: false
type: internal
- name: tls
port: 9093
tls: true
type: internal
logging:
loggers:
kafka.root.logger.level: DEBUG
type: inline
replicas: 4
storage:
deleteClaim: true
size: 1Gi
type: persistent-claim
version: 3.2.0
zookeeper:
logging:
loggers:
zookeeper.root.logger: DEBUG
type: inline
replicas: 3
storage:
deleteClaim: true
size: 1Gi
type: persistent-claim
status:
conditions:
- lastTransitionTime: "2022-05-25T08:20:30.678Z"
message: Error while waiting for restarted pod my-cluster-5261ed90-kafka-0 to
become ready
reason: FatalProblem
status: "True"
type: NotReady
observedGeneration: 2
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 21 (18 by maintainers)
Running Strimzi 0.33.0 and deployed a
Kafkacluster with KRaft mode enabled and Kafka version 3.3.2. I was able to scale up from 3 to 5 with no errors and then scaling down again.On the community call, December 15th, the we thought that taking into account the load of work we need to do on KRaft, we could wait for the proper fix in Kafka to come in the next version instead of applying a “workaround” right now to fix this. Actually, not setting this issue as higher priority. If the Kafka fix won’t happen soon we could review the decision.