kiam: kiam server HA. Context deadline exceeded while rotating servers.
In my test AWS lab I run 3 masters with KIAM servers (quay.io/uswitch/kiam:v3.0) and 6 nodes with KIAM agents (same images). I’m getting these
{"addr":"100.112.154.80:56850","level":"error","method":"GET","msg":"error processing request: rpc error: code = DeadlineExceeded desc = context deadline exceeded","path":"/latest/meta-data/iam/security-credentials/","status":500,"time":"2019-02-18T16:48:17Z"}
every time I’m rotating masters. The error starts to appear on every one of 6 agents, once I remove at least 1 master. To stop it I need to restart all agents, when they restart they start to work normally. This does not happen if I rotate kiam server itself (delete the pod), only when I rotate the whole master instance.
I’m using this chart https://github.com/helm/charts/tree/master/stable/kiam to run kiam. No special tuning, except for I’m setting gatewayTimeoutCreation: 500ms on both agent and server.
Am I missing smth important in docs, some flag? Please suggest
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 13
- Comments: 43 (20 by maintainers)
Same issue here, KIAM v.3.2
I am still getting the same error even after upgrading to v3.5. Whenever a new node comes up, the agent restarts with the following error:
"msg":"error creating server gateway: error dialing grpc server: context deadline exceeded"Anyone else also facing this issue?@mechpen - Great spot. I have been doing some testing today and inserted the Keepalive in to a local branch of Kiam - https://github.com/mehstg/kiam/tree/grpc-keepalive-testing
I have just spun it up on one of my test clusters, and cannot get it to fail now, whether I ungracefully terminate the instance, or drain it correctly.
I should report that since moving
kiam-serverto aDeploymentinstead of aDaemonSetwe’ve not experienced any issues on our 6 clusters, including a migration from 1.11 -> 1.12 -> 1.13 -> 1.14 for all of them. I think this is down to each server component being gracefully terminated when theDeploymentdrains correctly due tokubectl draincorrectly handling theDeploymentand sending properSIGTERMevents to pods.While this doesn’t solve the issue of random server failures causing the same issue, it does mean we can roll our control plane without fear of breaking our workloads IAM credentials.
Here’s the agent logs when running as a
DaemonSetwhile usingkops rolling-update clusterto recycle our master nodes:At
10:12:34Zyou can see the agent returnscontext deadline exceededto the client, which is what is causing issues. Our client apps receive this as the response from the meta-data API and this causes further issues (confirmed by running a podcurling the meta-data endpoint every second. At least on request came back withcontext deadline exceededas the response).I think this is down to the
kiam-serverbeing available right up until the power is switched off (EC2 API call toTerminateInstance). I think there’s a TCP wait/timeout thing going on, wherekiam-agentwas either 1/2 way through a request to the now dead server, or just thinks the server is still there because it never got notified it was going away.I’m off to test this when running the
kiam-serverpods as part of aDeployment. My hypothesis is that they will receive graceful shutdowns and notify any gRPC clients it’s talking to that it is going away. We’ll seeOk. So I have been doing some more testing of this issue.
From inside a pod, when I run a cURL to request the credentials from the metadata service, it works fine before killing a KIAM server, as expected:
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/dev-eu-west-1-dev01-pod-api {"Code":"Success","Type":"AWS-HMAC","AccessKeyId":"ASIAYXSSZXR7IBB2RQMP","SecretAccessKey":"O3h+vJPsu8vhWtyWOCkiV9rwh4F/3HMF+jHpMpCP","Token":"FQoGZXIvYXdzEDkaDIL6Kpd6o56IKCaqdyLtAXPxNnhIs6lvNrJwANlEupVSSwGqUzOxOhry10nANFU5VB28YPNc8BYnSy6B8+tMkw10d9ubPwfFj63sdGpuHAS+4wtrpPRbb5p2ma/qvyUu8p33t9+NGZ6BNGYUga43E9+erGSHjDuvD9I4Lwjcz831D3lAYidCAjb99Q+iOWMyScklGfH3YELKGpEtzg55KYPTH16xc+QtOUtdmcUZZsbGnvw04zb6gxhctUCneknWY9+ZfiIGRU2pXvUlavGPCtekyrTp4DnCN2IbDQnmNCfwsZfG0LZZ1yQw41FuZ3igPzzb6TM6a6LlU9gM+Cj4/oTqBQ==","Expiration":"2019-07-31T07:27:56Z","LastUpdated":"2019-07-31T07:12:56Z"}As soon as I kill a KIAM server, the same cURL command fails immediately:
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/dev-eu-west-1-dev01-pod-api error fetching credentials: rpc error: code = DeadlineExceeded desc = context deadline exceededI can then repeatedly cURL the same request and sometimes it returns OK, sometimes it fails.The matching log line from the KIAM agent shows similar information: {“addr”:“10.1.113.124:49850”,“level”:“error”,“method”:“HEAD”,“msg”:“error processing request: error fetching credentials: rpc error: code = DeadlineExceeded desc = context deadline exceeded”,“path”:“/latest/meta-data/iam/security-credentials/dev-eu-west-1-dev01-pod-api”,“status”:500,“time”:“2019-07-31T07:22:00Z”}`
I ran a nslookup from inside a pod at the point of killing the KIAM server too. The record for the killed KIAM server disappears almost instantaneously. (For reference, we are using CoreDNS) .
It feels very much like something is being cached on the KIAM agents in relation to the IP addresses available to gRPC load balance with. DNS is very quickly removing the retired IP from the pool, but it feels like the KIAM agent is still trying to route requests to it.
The access key/secret/token above is nonsensical, it is just an example
Kiam v3.5 https://github.com/uswitch/kiam/releases/tag/v3.5 has a potential fix for this issue, has anyone been able to reproduce it on 3.5?
Ok. So I have just been burned by this again on our prod system. Cert-manager replaced the TLS certificates used by KIAM and the kiam-servers rebooted. This caused all of the kiam-agents to immediately start logging
error processing request: rpc error: code = DeadlineExceeded desc = context deadline exceededand we lost all pod IAM access.OK testing with
kiam-serveras aDeploymentcompleted and it looks a lot healthier:I rebooted 5 nodes none of my
kiam-agents returnedcontext deadline exceededin this period. I think this is because the pods in aDeploymentare correctly terminated in a graceful manner and gRPC can cope with that better than the power cord being unplugged like it is with aDaemonSet.I’d love for someone to double-check these findings though? The PR for
kiam-serveras aDeploymentis hereI don’t think this solves the problem entirely though, as the same “pull the plug” event could happen at any time and agents might still serve
context deadline exceededresponses to clients.I am struggling to reproduce the issue myself - when killing Kiam servers in various different ways, the Kiam agent is correctly load balancing between the remaining Kiam servers.
In order to observe the agent’s behaviour, I set the following env vars in the agent pod to increase the log verbosity of grpc:
GRPC_GO_LOG_SEVERITY_LEVEL=infoGRPC_GO_LOG_VERBOSITY_LEVEL=8The following events occur on 1 agent that load balances between 2 different servers:
The agent starts up and performs some service discovery to find the servers:
The agent connects to the servers successfully (below) - note that each server is associated with a hex value, in this instance we have the following 2 servers: 10.244.1.3 - 0xc4203c1d40 10.244.2.2 - 0xc4203c1d60
^ here, one of the grpc connections becomes ready before the other and the ‘roundrobinPicker’ lists this as the only server available in its pool to load balance with.
^ the other connection becomes ready and the load balancing pool is updated to include the second server.
At this point, I kill one of the Kiam servers (10.244.1.3 - 0xc4203c1d40):
^ the agent attempts to reconnect to the server I killed. The agent will continue to try and re-connect with the server that has gone away over time (I have omitted these logs)
^ immediately after the connection error, the pool of servers to load balance with is updated to just include the 1 remaining server.
Throughout this, my curls from another pod to the meta-data API have been successful - there hasn’t been any sign of the issue reported (these curls will also show in the agent logs but have been omitted above for clarity).
Would somebody be able to set the grpc log level env vars mentioned above in their agent deployments and provide agent logs? Including examples of agent startup and when a server is killed, so we can compare. Thanks.
Same issue for us on kiam v3.0
similar issue, kiam v3.3