kube2iam: Interface conversion fails, causing pods not to get credentials.
Sometimes, pods in my cluster fail with errors like “fatal error: Unable to locate credentials”. Checking the kube2iam logs for the instance on the same node, I see entries like this:
time="2018-11-20T18:40:26Z" level=error msg="PANIC error processing request: interface conversion: interface {} is nil, not *v1.Pod" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=172.19.97.48 res.status=500
This occurs on an EKS cluster running Kubernetes 1.10.3.
About this issue
- Original URL
- State: open
- Created 6 years ago
- Reactions: 40
- Comments: 49 (4 by maintainers)
Commits related to this issue
- Add error logs to try and detect #178. — committed to jrnt30/kube2iam by deleted user 6 years ago
- Add error logs to try and detect #178. — committed to jtblin/kube2iam by deleted user 6 years ago
- Do not check the pod's deletionTimestamp This was causing nil entries in the pod index which resulted in credentials failing to be issued. See #178 for more information. — committed to rifelpet/kube2iam by rifelpet 5 years ago
- Do not check the pod's deletionTimestamp (#203) This was causing nil entries in the pod index which resulted in credentials failing to be issued. See #178 for more information. — committed to jtblin/kube2iam by rifelpet 5 years ago
- Do not check the pod's deletionTimestamp (#203) This was causing nil entries in the pod index which resulted in credentials failing to be issued. See #178 for more information. — committed to jessestuart/kube2iam by rifelpet 5 years ago
rolled back to 0.10.4 error is disappeared. i won’t be using “latest” i won’t be using “latest” repaet 10 times 😃
getting this on EKS 1.11.5
If someone still has the error:
I did set the env variable AWS_METADATA_SERVICE_TIMEOUT=3 in the pod and it solves the problem. What happens sometimes is that with kube2iam the metadata connexion takes more than 1 second to answer, so the default timeout for metadata service is reached and since there is no other authentication method we have this error.
Is this really fixed? I still have issue with 0.10.7 version. I had no issues with 0.10.4…
Quick update: we have been running a forked of the code with this for isPodActive:
We haven’t had any issue so far.
We still need to test for PodSucceeded and PodFailed because pods in this status keep their IP in etcd but it can be reallocated by the runtime/cni because the kubelet has deleted the sandbox. In this case, we catch an update with the phase change which means the IP from the previous status will be deleted from the index (https://github.com/kubernetes/client-go/blob/master/tools/cache/thread_safe_store.go#L253-L255) and since indexFunc will return nil nothing will be added back.
Similarly, we’ve seen some update to the pod status where the IP is removed before the deletion event. Since indexFunc also returns nil for
podIP==""this means this update will also remove the IP the IP from the index.As an extra precaution, we could add some logic in
PodByIPto check that the pod is not nil (which could happen if the index end up in an inconsistent state, which should not happen except maybe in some edge cases such as out of order events) and that it is not Succeeded or Failed. We would have to do that by filtering over all pods in the set (in case we have a former Succeeded pod and the Running one indexed at the same IP for instance).Can also confirm in EKS seeing this issue, going from tag:latest to tag:0.10.4 100% fixed this for us.
@rifelpet I have an idea for the issue before #173 I think the issue could happen on force deletes (I haven’t tested it though)
@rifelpet Thank you for the PR. I can’t easily think of a scenario where this would happen before #173 : it would be the pod has been deleted from the cache and the index is stale. I wonder if there could be a race condition in the deletion event:
I’ll need to dig into client-go to this if this can happen
A few ideas to work around the issue:
I found this issue because we encountered the issue solved by #173 (no credentials for pods in the terminating phase). Since the original issue has been hard on us, I think I’ll try to cherry-pick #173 on 0.10.4. I’ll let you know if it works or not for us (at least we could get a confirmation that the root cause is #173)