franz-go: ILLEGAL_SASL_STATE: Request is not valid given the current SASL state

We are using franz-go to connect to an AWS MSK cluster using an IAM role. We are seeing our OnPartitionsLost and/or OnPartitionsRevoked methods called each time franz-go attempts to re-authenticate with the MSK cluster using SASL/IAM. This is resulting in the service getting stuck in a re-authentication loop for 2-3 seconds each time before it re-creates the connection to Kafka.

Logs:

2022-11-10T00:23:54.933-07:00 [INFO] sasl lifetime minus 2.5s lower bound latency results in immediate reauthentication, sleeping 100ms to avoid spin-loop; broker: 3, session_lifetime: 1.068s, latency_lower_bound: 2.5s
2022-11-10T00:23:55.035-07:00 [INFO] metadata update triggered; why: opportunistic load during source backoff
2022-11-10T00:23:55.127-07:00 [INFO] sasl lifetime minus 2.5s lower bound latency results in immediate reauthentication, sleeping 100ms to avoid spin-loop; broker: 1, session_lifetime: 873ms, latency_lower_bound: 2.5s
2022-11-10T00:23:55.241-07:00 [INFO] sasl lifetime minus 2.5s lower bound latency results in immediate reauthentication, sleeping 100ms to avoid spin-loop; broker: 1, session_lifetime: 760ms, latency_lower_bound: 2.5s
2022-11-10T00:23:55.299-07:00 [INFO] sasl lifetime minus 2.5s lower bound latency results in immediate reauthentication, sleeping 100ms to avoid spin-loop; broker: 3, session_lifetime: 702ms, latency_lower_bound: 2.5s
2022-11-10T00:23:55.413-07:00 [INFO] sasl lifetime minus 2.5s lower bound latency results in immediate reauthentication, sleeping 100ms to avoid spin-loop; broker: 3, session_lifetime: 588ms, latency_lower_bound: 2.5s
2022-11-10T00:23:55.563-07:00 [INFO] metadata update triggered; why: re-updating metadata due to err: ILLEGAL_SASL_STATE: Request is not valid given the current SASL state.
2022-11-10T00:23:55.618-07:00 [INFO] sasl lifetime minus 2.5s lower bound latency results in immediate reauthentication, sleeping 100ms to avoid spin-loop; broker: 2, session_lifetime: 383ms, latency_lower_bound: 2.5s
2022-11-10T00:23:56.122-07:00 [ERROR] unable to initialize sasl; broker: 3, err: [***REDACTED***]: Session too short: SASL_AUTHENTICATION_FAILED: SASL Authentication failed.
2022-11-10T00:23:56.131-07:00 [INFO] metadata update triggered; why: re-updating metadata due to err: ILLEGAL_SASL_STATE: Request is not valid given the current SASL state.
2022-11-10T00:23:57.609-07:00 [INFO] metadata update triggered; why: re-updating metadata due to err: Cannot change principals during re-authentication from ***REDACTED***: ***REDACTED***: SASL_AUTHENTICATION_FAILED: SASL Authentication failed.
2022-11-10T00:35:07.692-07:00 [INFO] sasl lifetime minus 2.5s lower bound latency results in immediate reauthentication, sleeping 100ms to avoid spin-loop; broker: 1, session_lifetime: 2.308s, latency_lower_bound: 2.5s
2022-11-10T00:35:07.794-07:00 [INFO] metadata update triggered; why: opportunistic load during source backoff
2022-11-10T00:35:07.855-07:00 [INFO] sasl lifetime minus 2.5s lower bound latency results in immediate reauthentication, sleeping 100ms to avoid spin-loop; broker: 1, session_lifetime: 2.146s, latency_lower_bound: 2.5s
2022-11-10T00:35:07.904-07:00 [INFO] sasl lifetime minus 2.5s lower bound latency results in immediate reauthentication, sleeping 100ms to avoid spin-loop; broker: 3, session_lifetime: 2.096s, latency_lower_bound: 2.5s
2022-11-10T00:35:07.969-07:00 [INFO] sasl lifetime minus 2.5s lower bound latency results in immediate reauthentication, sleeping 100ms to avoid spin-loop; broker: 1, session_lifetime: 2.032s, latency_lower_bound: 2.5s
2022-11-10T00:35:08.090-07:00 [INFO] sasl lifetime minus 2.5s lower bound latency results in immediate reauthentication, sleeping 100ms to avoid spin-loop; broker: 1, session_lifetime: 1.91s, latency_lower_bound: 2.5s
2022-11-10T00:35:08.195-07:00 [INFO] sasl lifetime minus 2.5s lower bound latency results in immediate reauthentication, sleeping 100ms to avoid spin-loop; broker: 2, session_lifetime: 1.805s, latency_lower_bound: 2.5s
2022-11-10T00:35:08.205-07:00 [INFO] sasl lifetime minus 2.5s lower bound latency results in immediate reauthentication, sleeping 100ms to avoid spin-loop; broker: 1, session_lifetime: 1.796s, latency_lower_bound: 2.5s
2022-11-10T00:35:08.330-07:00 [INFO] metadata update triggered; why: opportunistic load during source backoff
2022-11-10T00:35:08.359-07:00 [INFO] sasl lifetime minus 2.5s lower bound latency results in immediate reauthentication, sleeping 100ms to avoid spin-loop; broker: 2, session_lifetime: 1.642s, latency_lower_bound: 2.5s
2022-11-10T00:35:08.365-07:00 [INFO] sasl lifetime minus 2.5s lower bound latency results in immediate reauthentication, sleeping 100ms to avoid spin-loop; broker: 3, session_lifetime: 1.636s, latency_lower_bound: 2.5s
2022-11-10T00:35:08.478-07:00 [INFO] sasl lifetime minus 2.5s lower bound latency results in immediate reauthentication, sleeping 100ms to avoid spin-loop; broker: 3, session_lifetime: 1.522s, latency_lower_bound: 2.5s
2022-11-10T00:35:08.561-07:00 [INFO] sasl lifetime minus 2.5s lower bound latency results in immediate reauthentication, sleeping 100ms to avoid spin-loop; broker: 2, session_lifetime: 1.44s, latency_lower_bound: 2.5s
2022-11-10T00:35:08.693-07:00 [INFO] sasl lifetime minus 2.5s lower bound latency results in immediate reauthentication, sleeping 100ms to avoid spin-loop; broker: 2, session_lifetime: 1.308s, latency_lower_bound: 2.5s
2022-11-10T00:35:08.795-07:00 [INFO] heartbeat errored; group: ***REDACTED***, err: ILLEGAL_SASL_STATE: Request is not valid given the current SASL state.
2022-11-10T00:35:08.795-07:00 [INFO] assigning partitions; why: clearing assignment at end of group management session, how: unassigning everything, input:
2022-11-10T00:35:08.796-07:00 [ERROR] join and sync loop errored; group: ***REDACTED***, err: ILLEGAL_SASL_STATE: Request is not valid given the current SASL state., consecutive_errors: 5, backoff: 2.5s
2022-11-10T00:35:09.023-07:00 [INFO] metadata update triggered; why: re-updating metadata due to err: ILLEGAL_SASL_STATE: Request is not valid given the current SASL state.
2022-11-10T00:35:09.087-07:00 [INFO] sasl lifetime minus 2.5s lower bound latency results in immediate reauthentication, sleeping 100ms to avoid spin-loop; broker: 3, session_lifetime: 914ms, latency_lower_bound: 2.5s
2022-11-10T00:35:09.206-07:00 [INFO] sasl lifetime minus 2.5s lower bound latency results in immediate reauthentication, sleeping 100ms to avoid spin-loop; broker: 3, session_lifetime: 795ms, latency_lower_bound: 2.5s
2022-11-10T00:35:10.252-07:00 [INFO] metadata update triggered; why: re-updating metadata due to err: ILLEGAL_SASL_STATE: Request is not valid given the current SASL state.
2022-11-10T00:35:11.296-07:00 [INFO] joining group; group: ***REDACTED***
2022-11-10T00:35:11.354-07:00 [INFO] joined, balancing group; group: ***REDACTED***, member_id: ***REDACTED***, instance_id: <nil>, generation: 58, balance_protocol: cooperative-sticky, leader: true
2022-11-10T00:35:11.354-07:00 [INFO] balancing group as leader
2022-11-10T00:35:11.354-07:00 [INFO] balance group member; id: ***REDACTED***, interests: interested topics: [***REDACTED*** ***REDACTED*** ***REDACTED***], previously owned:
2022-11-10T00:35:11.354-07:00 [INFO] balanced; plan: ***REDACTED***{***REDACTED***[0 1 2 3], ***REDACTED***[0 1 2 3], ***REDACTED***[0 1 2 3]}
2022-11-10T00:35:11.354-07:00 [INFO] syncing; group: ***REDACTED***, protocol_type: consumer, protocol: cooperative-sticky
2022-11-10T00:35:11.361-07:00 [INFO] synced; group: ***REDACTED***, assigned: ***REDACTED***[0 1 2 3], ***REDACTED***[0 1 2 3], ***REDACTED***[0 1 2 3]
2022-11-10T00:35:11.361-07:00 [INFO] new group session begun; group: ***REDACTED***, added: ***REDACTED***[0 1 2 3], ***REDACTED***[0 1 2 3], ***REDACTED***[0 1 2 3], lost:
2022-11-10T00:35:11.361-07:00 [INFO] beginning heartbeat loop; group: ***REDACTED***
2022-11-10T00:35:11.406-07:00 [INFO] assigning partitions; why: newly fetched offsets for group ***REDACTED***, how: assigning everything new, keeping current assignment, input: ***REDACTED***[1{-2.-1 0} 2{-2.-1 0} 0{-2.-1 0} 3{-2.-1 0}], ***REDACTED***[2{24031.14 0} 1{23435.14 0} 0{27120.16 0} 3{26493.16 0}], ***REDACTED***[1{7953.16 0} 0{9211.13 0} 3{8975.14 0} 2{8149.12 0}]
2022-11-10T00:37:02.642-07:00 [INFO] sasl lifetime minus 2.5s lower bound latency results in immediate reauthentication, sleeping 100ms to avoid spin-loop; broker: 1, session_lifetime: 2.359s, latency_lower_bound: 2.5s
2022-11-10T00:37:02.743-07:00 [INFO] metadata update triggered; why: opportunistic load during source backoff
2022-11-10T00:37:02.775-07:00 [INFO] sasl lifetime minus 2.5s lower bound latency results in immediate reauthentication, sleeping 100ms to avoid spin-loop; broker: 2, session_lifetime: 2.226s, latency_lower_bound: 2.5s

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 33 (15 by maintainers)

Commits related to this issue

Most upvoted comments

fwiw, a 100ms check might not be 100% resilient: the check likely happens before the session is created, so if there’s a hang between the check and the session, then the problem will be created internally.

The franz-go client is now behaving exactly the same as the Java client by reauthenticating far far earlier than the session expires. What issues like this have shown is that AWS is returning the same session, and eventually it returns a session with a <2s lifetime. I’m not sure what difference there could be in this client any more, and I’m not sure what could change on the client side at this point.