kiam: kiam server HA. Context deadline exceeded while rotating servers.

In my test AWS lab I run 3 masters with KIAM servers (quay.io/uswitch/kiam:v3.0) and 6 nodes with KIAM agents (same images). I’m getting these

{"addr":"100.112.154.80:56850","level":"error","method":"GET","msg":"error processing request: rpc error: code = DeadlineExceeded desc = context deadline exceeded","path":"/latest/meta-data/iam/security-credentials/","status":500,"time":"2019-02-18T16:48:17Z"}

every time I’m rotating masters. The error starts to appear on every one of 6 agents, once I remove at least 1 master. To stop it I need to restart all agents, when they restart they start to work normally. This does not happen if I rotate kiam server itself (delete the pod), only when I rotate the whole master instance.

I’m using this chart https://github.com/helm/charts/tree/master/stable/kiam to run kiam. No special tuning, except for I’m setting gatewayTimeoutCreation: 500ms on both agent and server.

Am I missing smth important in docs, some flag? Please suggest

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 13
Comments: 43 (20 by maintainers)

Commits related to this issue

enable gRPC keepalive to detect dead TCP connections (#217) (#337) When a TCP connection dies, i.e. no more packet from peer, gRPC does not close the connection by default. We need to enable gRPC ... — committed to uswitch/kiam by mechpen 5 years ago

Most upvoted comments

Same issue here, KIAM v.3.2

complex64 on Apr 23, 2019

I am still getting the same error even after upgrading to v3.5. Whenever a new node comes up, the agent restarts with the following error: "msg":"error creating server gateway: error dialing grpc server: context deadline exceeded" Anyone else also facing this issue?

shrutilamba on Aug 19, 2020

@mechpen - Great spot. I have been doing some testing today and inserted the Keepalive in to a local branch of Kiam - https://github.com/mehstg/kiam/tree/grpc-keepalive-testing

I have just spun it up on one of my test clusters, and cannot get it to fail now, whether I ungracefully terminate the instance, or drain it correctly.

mehstg on Nov 24, 2019

I should report that since moving kiam-server to a Deployment instead of a DaemonSet we’ve not experienced any issues on our 6 clusters, including a migration from 1.11 -> 1.12 -> 1.13 -> 1.14 for all of them. I think this is down to each server component being gracefully terminated when the Deployment drains correctly due to kubectl drain correctly handling the Deployment and sending proper SIGTERM events to pods.

While this doesn’t solve the issue of random server failures causing the same issue, it does mean we can roll our control plane without fear of breaking our workloads IAM credentials.

daviddyball on Nov 1, 2019

Here’s the agent logs when running as a DaemonSet while using kops rolling-update cluster to recycle our master nodes:

INFO: 2019/09/12 10:12:33 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/09/12 10:12:33 base.baseBalancer: handle SubConn state change: 0xc4201779f0, TRANSIENT_FAILURE
INFO: 2019/09/12 10:12:33 roundrobinPicker: newPicker called with readySCs: map[{100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203d4db0 {100.96.9.121:443 0  <nil>}:0xc4203d5b10 {100.96.12.99:443 0  <nil>}:0xc4203d4e80 {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a10 {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a30 {100.96.11.93:443 0  <nil>}:0xc420177ac0 {100.96.10.125:443 0  <nil>}:0xc420177aa0]
INFO: 2019/09/12 10:12:33 base.baseBalancer: handle SubConn state change: 0xc4201779f0, CONNECTING
INFO: 2019/09/12 10:12:33 base.baseBalancer: handle SubConn state change: 0xc4203d5b10, TRANSIENT_FAILURE
INFO: 2019/09/12 10:12:33 roundrobinPicker: newPicker called with readySCs: map[{100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203d4db0 {100.96.12.99:443 0  <nil>}:0xc4203d4e80 {100.96.10.125:443 0  <nil>}:0xc420177aa0 {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a30 {100.96.11.93:443 0  <nil>}:0xc420177ac0 {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a10]
INFO: 2019/09/12 10:12:33 base.baseBalancer: handle SubConn state change: 0xc4203d5b10, CONNECTING
INFO: 2019/09/12 10:12:33 ccResolverWrapper: got new service config: 
INFO: 2019/09/12 10:12:33 ccResolverWrapper: sending new addresses to cc: [{100.96.9.121:443 1 100-96-9-121.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 1 100-96-0-236.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 0  <nil>} {100.96.10.125:443 0  <nil>} {100.96.11.93:443 0  <nil>} {100.96.12.99:443 0  <nil>} {100.96.9.121:443 0  <nil>}]
INFO: 2019/09/12 10:12:33 base.baseBalancer: got new resolved addresses:  [{100.96.9.121:443 1 100-96-9-121.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 1 100-96-0-236.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 0  <nil>} {100.96.10.125:443 0  <nil>} {100.96.11.93:443 0  <nil>} {100.96.12.99:443 0  <nil>} {100.96.9.121:443 0  <nil>}]
INFO: 2019/09/12 10:12:33 base.baseBalancer: handle SubConn state change: 0xc4203d4790, CONNECTING
INFO: 2019/09/12 10:12:33 base.baseBalancer: handle SubConn state change: 0xc4203d47d0, CONNECTING
INFO: 2019/09/12 10:12:33 ccResolverWrapper: got new service config: 
INFO: 2019/09/12 10:12:33 ccResolverWrapper: sending new addresses to cc: [{100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.9.121:443 1 100-96-9-121.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 1 100-96-0-236.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 0  <nil>} {100.96.10.125:443 0  <nil>} {100.96.11.93:443 0  <nil>} {100.96.12.99:443 0  <nil>} {100.96.9.121:443 0  <nil>}]
INFO: 2019/09/12 10:12:33 base.baseBalancer: got new resolved addresses:  [{100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.9.121:443 1 100-96-9-121.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 1 100-96-0-236.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 0  <nil>} {100.96.10.125:443 0  <nil>} {100.96.11.93:443 0  <nil>} {100.96.12.99:443 0  <nil>} {100.96.9.121:443 0  <nil>}]
INFO: 2019/09/12 10:12:33 base.baseBalancer: handle SubConn state change: 0xc4203d4790, READY
INFO: 2019/09/12 10:12:33 roundrobinPicker: newPicker called with readySCs: map[{100.96.0.236:443 1 100-96-0-236.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203d4790 {100.96.11.93:443 0  <nil>}:0xc420177ac0 {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a10 {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a30 {100.96.12.99:443 0  <nil>}:0xc4203d4e80 {100.96.10.125:443 0  <nil>}:0xc420177aa0 {100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203d4db0]
INFO: 2019/09/12 10:12:33 base.baseBalancer: handle SubConn state change: 0xc4203d47d0, READY
INFO: 2019/09/12 10:12:33 roundrobinPicker: newPicker called with readySCs: map[{100.96.0.236:443 0  <nil>}:0xc4203d47d0 {100.96.10.125:443 0  <nil>}:0xc420177aa0 {100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203d4db0 {100.96.0.236:443 1 100-96-0-236.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203d4790 {100.96.11.93:443 0  <nil>}:0xc420177ac0 {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a10 {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a30 {100.96.12.99:443 0  <nil>}:0xc4203d4e80]

{"level":"warning","msg":"error finding role for pod: rpc error: code = DeadlineExceeded desc = context deadline exceeded","pod.ip":"100.96.14.178","time":"2019-09-12T10:12:34Z"}
{"addr":"100.96.14.178:35122","level":"error","method":"GET","msg":"error processing request: rpc error: code = DeadlineExceeded desc = context deadline exceeded","path":"/latest/meta-data/iam/security-credentials/","status":500,"time":"2019-09-12T10:12:34Z"}
{"addr":"100.96.14.178:35122","duration":5000,"headers":{"Content-Type":["text/plain; charset=utf-8"],"X-Content-Type-Options":["nosniff"]},"level":"info","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/","status":500,"time":"2019-09-12T10:12:34Z"}

INFO: 2019/09/12 10:12:34 transport: loopyWriter.run returning. connection error: desc = "transport is closing"

WARNING: 2019/09/12 10:12:53 grpc: addrConn.createTransport failed to connect to {100.96.9.121:443 1 100-96-9-121.kiam-server.kube-system.svc.cluster.local. <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 100.96.9.121:443: i/o timeout". Reconnecting...
WARNING: 2019/09/12 10:12:53 grpc: addrConn.createTransport failed to connect to {100.96.9.121:443 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 100.96.9.121:443: i/o timeout". Reconnecting...
INFO: 2019/09/12 10:12:53 base.baseBalancer: handle SubConn state change: 0xc4201779f0, TRANSIENT_FAILURE
INFO: 2019/09/12 10:12:53 base.baseBalancer: handle SubConn state change: 0xc4203d5b10, TRANSIENT_FAILURE
INFO: 2019/09/12 10:12:53 base.baseBalancer: handle SubConn state change: 0xc4203d5b10, CONNECTING
INFO: 2019/09/12 10:12:53 base.baseBalancer: handle SubConn state change: 0xc4201779f0, CONNECTING
INFO: 2019/09/12 10:12:53 ccResolverWrapper: got new service config: 
INFO: 2019/09/12 10:12:53 ccResolverWrapper: sending new addresses to cc: [{100.96.0.236:443 1 100-96-0-236.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 0  <nil>} {100.96.10.125:443 0  <nil>} {100.96.11.93:443 0  <nil>} {100.96.12.99:443 0  <nil>}]
INFO: 2019/09/12 10:12:53 base.baseBalancer: got new resolved addresses:  [{100.96.0.236:443 1 100-96-0-236.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 0  <nil>} {100.96.10.125:443 0  <nil>} {100.96.11.93:443 0  <nil>} {100.96.12.99:443 0  <nil>}]
INFO: 2019/09/12 10:12:53 base.baseBalancer: handle SubConn state change: 0xc4203d5b10, SHUTDOWN
INFO: 2019/09/12 10:12:53 base.baseBalancer: handle SubConn state change: 0xc4201779f0, SHUTDOWN
WARNING: 2019/09/12 10:12:53 grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
WARNING: 2019/09/12 10:12:53 grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
INFO: 2019/09/12 10:12:53 ccResolverWrapper: got new service config: 
INFO: 2019/09/12 10:12:53 ccResolverWrapper: sending new addresses to cc: [{100.96.0.236:443 1 100-96-0-236.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 0  <nil>} {100.96.10.125:443 0  <nil>} {100.96.11.93:443 0  <nil>} {100.96.12.99:443 0  <nil>}]

At 10:12:34Z you can see the agent returns context deadline exceeded to the client, which is what is causing issues. Our client apps receive this as the response from the meta-data API and this causes further issues (confirmed by running a pod curling the meta-data endpoint every second. At least on request came back with context deadline exceeded as the response).

I think this is down to the kiam-server being available right up until the power is switched off (EC2 API call to TerminateInstance). I think there’s a TCP wait/timeout thing going on, where kiam-agent was either 1/2 way through a request to the now dead server, or just thinks the server is still there because it never got notified it was going away.

I’m off to test this when running the kiam-server pods as part of a Deployment. My hypothesis is that they will receive graceful shutdowns and notify any gRPC clients it’s talking to that it is going away. We’ll see

daviddyball on Sep 12, 2019

Ok. So I have been doing some more testing of this issue.

From inside a pod, when I run a cURL to request the credentials from the metadata service, it works fine before killing a KIAM server, as expected: curl http://169.254.169.254/latest/meta-data/iam/security-credentials/dev-eu-west-1-dev01-pod-api {"Code":"Success","Type":"AWS-HMAC","AccessKeyId":"ASIAYXSSZXR7IBB2RQMP","SecretAccessKey":"O3h+vJPsu8vhWtyWOCkiV9rwh4F/3HMF+jHpMpCP","Token":"FQoGZXIvYXdzEDkaDIL6Kpd6o56IKCaqdyLtAXPxNnhIs6lvNrJwANlEupVSSwGqUzOxOhry10nANFU5VB28YPNc8BYnSy6B8+tMkw10d9ubPwfFj63sdGpuHAS+4wtrpPRbb5p2ma/qvyUu8p33t9+NGZ6BNGYUga43E9+erGSHjDuvD9I4Lwjcz831D3lAYidCAjb99Q+iOWMyScklGfH3YELKGpEtzg55KYPTH16xc+QtOUtdmcUZZsbGnvw04zb6gxhctUCneknWY9+ZfiIGRU2pXvUlavGPCtekyrTp4DnCN2IbDQnmNCfwsZfG0LZZ1yQw41FuZ3igPzzb6TM6a6LlU9gM+Cj4/oTqBQ==","Expiration":"2019-07-31T07:27:56Z","LastUpdated":"2019-07-31T07:12:56Z"}

As soon as I kill a KIAM server, the same cURL command fails immediately: curl http://169.254.169.254/latest/meta-data/iam/security-credentials/dev-eu-west-1-dev01-pod-api error fetching credentials: rpc error: code = DeadlineExceeded desc = context deadline exceeded I can then repeatedly cURL the same request and sometimes it returns OK, sometimes it fails.

The matching log line from the KIAM agent shows similar information: {“addr”:“10.1.113.124:49850”,“level”:“error”,“method”:“HEAD”,“msg”:“error processing request: error fetching credentials: rpc error: code = DeadlineExceeded desc = context deadline exceeded”,“path”:“/latest/meta-data/iam/security-credentials/dev-eu-west-1-dev01-pod-api”,“status”:500,“time”:“2019-07-31T07:22:00Z”}`

I ran a nslookup from inside a pod at the point of killing the KIAM server too. The record for the killed KIAM server disappears almost instantaneously. (For reference, we are using CoreDNS) .

It feels very much like something is being cached on the KIAM agents in relation to the IP addresses available to gRPC load balance with. DNS is very quickly removing the retired IP from the pool, but it feels like the KIAM agent is still trying to route requests to it.

The access key/secret/token above is nonsensical, it is just an example

mehstg on Jul 31, 2019

Kiam v3.5 https://github.com/uswitch/kiam/releases/tag/v3.5 has a potential fix for this issue, has anyone been able to reproduce it on 3.5?

Joseph-Irving on Jan 21, 2020

Ok. So I have just been burned by this again on our prod system. Cert-manager replaced the TLS certificates used by KIAM and the kiam-servers rebooted. This caused all of the kiam-agents to immediately start logging error processing request: rpc error: code = DeadlineExceeded desc = context deadline exceeded and we lost all pod IAM access.

mehstg on Nov 6, 2019

OK testing with kiam-server as a Deployment completed and it looks a lot healthier:

INFO: 2019/09/12 10:47:05 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/09/12 10:47:05 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4203b6df0, TRANSIENT_FAILURE
INFO: 2019/09/12 10:47:05 roundrobinPicker: newPicker called with readySCs: map[{100.96.3.77:443 0  <nil>}:0xc420520af0 {100.96.0.159:443 0  <nil>}:0xc4203b6e30 {100.96.10.42:443 0  <nil>}:0xc4203b6e70 {100.96.3.77:443 1 100-96-3-77.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420520ad0 {100.96.11.137:443 1 100-96-11-137.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203b6da0 {100.96.11.137:443 0  <nil>}:0xc4203b6e90 {100.96.0.159:443 1 100-96-0-159.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203b6d80]
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4203b6df0, CONNECTING
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4203b6e70, TRANSIENT_FAILURE
INFO: 2019/09/12 10:47:05 roundrobinPicker: newPicker called with readySCs: map[{100.96.11.137:443 1 100-96-11-137.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203b6da0 {100.96.0.159:443 0  <nil>}:0xc4203b6e30 {100.96.3.77:443 1 100-96-3-77.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420520ad0 {100.96.0.159:443 1 100-96-0-159.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203b6d80 {100.96.3.77:443 0  <nil>}:0xc420520af0 {100.96.11.137:443 0  <nil>}:0xc4203b6e90]
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4203b6e70, CONNECTING
WARNING: 2019/09/12 10:47:05 grpc: addrConn.createTransport failed to connect to {100.96.10.42:443 1 100-96-10-42.kiam-server.kube-system.svc.cluster.local. <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 100.96.10.42:443: connect: connection refused". Reconnecting...
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4203b6df0, TRANSIENT_FAILURE
WARNING: 2019/09/12 10:47:05 grpc: addrConn.createTransport failed to connect to {100.96.10.42:443 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 100.96.10.42:443: connect: connection refused". Reconnecting...
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4203b6e70, TRANSIENT_FAILURE
INFO: 2019/09/12 10:47:05 ccResolverWrapper: got new service config: 
INFO: 2019/09/12 10:47:05 ccResolverWrapper: sending new addresses to cc: [{100.96.11.137:443 1 100-96-11-137.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.10.42:443 1 100-96-10-42.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.4.29:443 1 100-96-4-29.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.3.77:443 1 100-96-3-77.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.159:443 1 100-96-0-159.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.159:443 0  <nil>} {100.96.11.137:443 0  <nil>} {100.96.3.77:443 0  <nil>} {100.96.4.29:443 0  <nil>}]
INFO: 2019/09/12 10:47:05 base.baseBalancer: got new resolved addresses:  [{100.96.11.137:443 1 100-96-11-137.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.10.42:443 1 100-96-10-42.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.4.29:443 1 100-96-4-29.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.3.77:443 1 100-96-3-77.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.159:443 1 100-96-0-159.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.159:443 0  <nil>} {100.96.11.137:443 0  <nil>} {100.96.3.77:443 0  <nil>} {100.96.4.29:443 0  <nil>}]
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4209792a0, CONNECTING
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4209792c0, CONNECTING
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4203b6e70, SHUTDOWN
WARNING: 2019/09/12 10:47:05 grpc: addrConn.transportMonitor exits due to: context canceled
INFO: 2019/09/12 10:47:05 ccResolverWrapper: got new service config: 
INFO: 2019/09/12 10:47:05 ccResolverWrapper: sending new addresses to cc: [{100.96.0.159:443 1 100-96-0-159.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.11.137:443 1 100-96-11-137.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.3.77:443 1 100-96-3-77.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.4.29:443 1 100-96-4-29.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.159:443 0  <nil>} {100.96.11.137:443 0  <nil>} {100.96.3.77:443 0  <nil>} {100.96.4.29:443 0  <nil>}]
INFO: 2019/09/12 10:47:05 base.baseBalancer: got new resolved addresses:  [{100.96.0.159:443 1 100-96-0-159.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.11.137:443 1 100-96-11-137.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.3.77:443 1 100-96-3-77.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.4.29:443 1 100-96-4-29.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.159:443 0  <nil>} {100.96.11.137:443 0  <nil>} {100.96.3.77:443 0  <nil>} {100.96.4.29:443 0  <nil>}]
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4203b6df0, SHUTDOWN
WARNING: 2019/09/12 10:47:05 grpc: addrConn.transportMonitor exits due to: context canceled
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4209792c0, READY
INFO: 2019/09/12 10:47:05 roundrobinPicker: newPicker called with readySCs: map[{100.96.3.77:443 0  <nil>}:0xc420520af0 {100.96.11.137:443 0  <nil>}:0xc4203b6e90 {100.96.3.77:443 1 100-96-3-77.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420520ad0 {100.96.11.137:443 1 100-96-11-137.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203b6da0 {100.96.4.29:443 0  <nil>}:0xc4209792c0 {100.96.0.159:443 0  <nil>}:0xc4203b6e30 {100.96.0.159:443 1 100-96-0-159.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203b6d80]
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4209792a0, READY
INFO: 2019/09/12 10:47:05 roundrobinPicker: newPicker called with readySCs: map[{100.96.3.77:443 0  <nil>}:0xc420520af0 {100.96.4.29:443 1 100-96-4-29.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4209792a0 {100.96.11.137:443 0  <nil>}:0xc4203b6e90 {100.96.3.77:443 1 100-96-3-77.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420520ad0 {100.96.11.137:443 1 100-96-11-137.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203b6da0 {100.96.4.29:443 0  <nil>}:0xc4209792c0 {100.96.0.159:443 0  <nil>}:0xc4203b6e30 {100.96.0.159:443 1 100-96-0-159.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203b6d80]

I rebooted 5 nodes none of my kiam-agents returned context deadline exceeded in this period. I think this is because the pods in a Deployment are correctly terminated in a graceful manner and gRPC can cope with that better than the power cord being unplugged like it is with a DaemonSet.

I’d love for someone to double-check these findings though? The PR for kiam-server as a Deployment is here

I don’t think this solves the problem entirely though, as the same “pull the plug” event could happen at any time and agents might still serve context deadline exceeded responses to clients.

daviddyball on Sep 12, 2019

I am struggling to reproduce the issue myself - when killing Kiam servers in various different ways, the Kiam agent is correctly load balancing between the remaining Kiam servers.

In order to observe the agent’s behaviour, I set the following env vars in the agent pod to increase the log verbosity of grpc:

GRPC_GO_LOG_SEVERITY_LEVEL=info GRPC_GO_LOG_VERBOSITY_LEVEL=8

The following events occur on 1 agent that load balances between 2 different servers:

The agent starts up and performs some service discovery to find the servers:

INFO: 2019/08/15 14:01:25 parsed scheme: "dns"
INFO: 2019/08/15 14:01:25 ccResolverWrapper: got new service config:
INFO: 2019/08/15 14:01:25 ccResolverWrapper: sending new addresses to cc: [{10.244.2.2:443 1 10-244-2-2.kiam-server.kube-system.svc.cluster.local. <nil>} {10.244.1.3:443 1 10-244-1-3.kiam-server.kube-system.svc.cluster.local. <nil>} {10.244.1.3:443 0  <nil>} {10.244.2.2:443 0  <nil>}]

The agent connects to the servers successfully (below) - note that each server is associated with a hex value, in this instance we have the following 2 servers: 10.244.1.3 - 0xc4203c1d40 10.244.2.2 - 0xc4203c1d60

INFO: 2019/08/15 14:01:25 base.baseBalancer: got new ClientConn state:  {{[{10.244.1.3:443 0  <nil>} {10.244.2.2:443 0  <nil>}] <nil>} <nil>}
INFO: 2019/08/15 14:01:25 base.baseBalancer: handle SubConn state change: 0xc4203c1d40, CONNECTING
INFO: 2019/08/15 14:01:25 base.baseBalancer: handle SubConn state change: 0xc4203c1d60, CONNECTING
INFO: 2019/08/15 14:01:25 base.baseBalancer: handle SubConn state change: 0xc4203c1d60, READY
INFO: 2019/08/15 14:01:25 roundrobinPicker: newPicker called with readySCs: map[{10.244.2.2:443 0  <nil>}:0xc4203c1d60]

^ here, one of the grpc connections becomes ready before the other and the ‘roundrobinPicker’ lists this as the only server available in its pool to load balance with.

INFO: 2019/08/15 14:01:25 base.baseBalancer: handle SubConn state change: 0xc4203c1d40, READY
INFO: 2019/08/15 14:01:25 roundrobinPicker: newPicker called with readySCs: map[{10.244.1.3:443 0  <nil>}:0xc4203c1d40 {10.244.2.2:443 0  <nil>}:0xc4203c1d60]

^ the other connection becomes ready and the load balancing pool is updated to include the second server.

At this point, I kill one of the Kiam servers (10.244.1.3 - 0xc4203c1d40):

INFO: 2019/08/15 14:15:32 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/08/15 14:15:32 base.baseBalancer: handle SubConn state change: 0xc4203c1d40, CONNECTING
WARNING: 2019/08/15 14:15:32 grpc: addrConn.createTransport failed to connect to {10.244.1.3:443 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.244.1.3:443: connect: connection refused". Reconnecting...`

^ the agent attempts to reconnect to the server I killed. The agent will continue to try and re-connect with the server that has gone away over time (I have omitted these logs)

INFO: 2019/08/15 14:15:32 roundrobinPicker: newPicker called with readySCs: map[{10.244.2.2:443 0  <nil>}:0xc4203c1d60]

^ immediately after the connection error, the pool of servers to load balance with is updated to just include the 1 remaining server.

Throughout this, my curls from another pod to the meta-data API have been successful - there hasn’t been any sign of the issue reported (these curls will also show in the agent logs but have been omitted above for clarity).

Would somebody be able to set the grpc log level env vars mentioned above in their agent deployments and provide agent logs? Including examples of agent startup and when a server is killed, so we can compare. Thanks.

rhysemmas on Aug 16, 2019

Same issue for us on kiam v3.0

mseiwald on Jul 24, 2019

similar issue, kiam v3.3

MilanDasek on Jul 17, 2019