kube2iam: Interface conversion fails, causing pods not to get credentials.

Sometimes, pods in my cluster fail with errors like “fatal error: Unable to locate credentials”. Checking the kube2iam logs for the instance on the same node, I see entries like this:

time="2018-11-20T18:40:26Z" level=error msg="PANIC error processing request: interface conversion: interface {} is nil, not *v1.Pod" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=172.19.97.48 res.status=500

This occurs on an EKS cluster running Kubernetes 1.10.3.

About this issue

Original URL
State: open
Created 6 years ago
Reactions: 40
Comments: 49 (4 by maintainers)

Commits related to this issue

Add error logs to try and detect #178. — committed to jrnt30/kube2iam by deleted user 6 years ago
Add error logs to try and detect #178. — committed to jtblin/kube2iam by deleted user 6 years ago
Do not check the pod's deletionTimestamp This was causing nil entries in the pod index which resulted in credentials failing to be issued. See #178 for more information. — committed to rifelpet/kube2iam by rifelpet 5 years ago
Do not check the pod's deletionTimestamp (#203) This was causing nil entries in the pod index which resulted in credentials failing to be issued. See #178 for more information. — committed to jtblin/kube2iam by rifelpet 5 years ago
Do not check the pod's deletionTimestamp (#203) This was causing nil entries in the pod index which resulted in credentials failing to be issued. See #178 for more information. — committed to jessestuart/kube2iam by rifelpet 5 years ago

Most upvoted comments

rolled back to 0.10.4 error is disappeared. i won’t be using “latest” i won’t be using “latest” repaet 10 times 😃

+11

admssa on Jan 26, 2019

getting this on EKS 1.11.5

admssa on Jan 26, 2019

If someone still has the error:

  File "/usr/local/lib/python3.7/site-packages/botocore/signers.py", line 157, in sign
    auth.add_auth(request)
  File "/usr/local/lib/python3.7/site-packages/botocore/auth.py", line 357, in add_auth
    raise NoCredentialsError
botocore.exceptions.NoCredentialsError: Unable to locate credentials

I did set the env variable AWS_METADATA_SERVICE_TIMEOUT=3 in the pod and it solves the problem. What happens sometimes is that with kube2iam the metadata connexion takes more than 1 second to answer, so the default timeout for metadata service is reached and since there is no other authentication method we have this error.

owengo on Aug 17, 2020

Is this really fixed? I still have issue with 0.10.7 version. I had no issues with 0.10.4…

mbelang on Aug 12, 2020

Quick update: we have been running a forked of the code with this for isPodActive:

	return p.Status.PodIP != "" &&
		v1.PodSucceeded != p.Status.Phase &&
		v1.PodFailed != p.Status.Phase

We haven’t had any issue so far.

We still need to test for PodSucceeded and PodFailed because pods in this status keep their IP in etcd but it can be reallocated by the runtime/cni because the kubelet has deleted the sandbox. In this case, we catch an update with the phase change which means the IP from the previous status will be deleted from the index (https://github.com/kubernetes/client-go/blob/master/tools/cache/thread_safe_store.go#L253-L255) and since indexFunc will return nil nothing will be added back.

Similarly, we’ve seen some update to the pod status where the IP is removed before the deletion event. Since indexFunc also returns nil for podIP=="" this means this update will also remove the IP the IP from the index.

As an extra precaution, we could add some logic in PodByIP to check that the pod is not nil (which could happen if the index end up in an inconsistent state, which should not happen except maybe in some edge cases such as out of order events) and that it is not Succeeded or Failed. We would have to do that by filtering over all pods in the set (in case we have a former Succeeded pod and the Running one indexed at the same IP for instance).

lbernail on Mar 4, 2019

Can also confirm in EKS seeing this issue, going from tag:latest to tag:0.10.4 100% fixed this for us.

lanmalkieri on Feb 8, 2019

@rifelpet I have an idea for the issue before #173 I think the issue could happen on force deletes (I haven’t tested it though)

lbernail on Mar 11, 2019

@rifelpet Thank you for the PR. I can’t easily think of a scenario where this would happen before #173 : it would be the pod has been deleted from the cache and the index is stale. I wonder if there could be a race condition in the deletion event:

indexer deletion is triggered
pod is deleted from the cache
we receive a get credential request
indexer deletion finishes

I’ll need to dig into client-go to this if this can happen

A few ideas to work around the issue:

update client-go to the latest version (a few fixes/improvements in the indexer in more recent version)
verify that the pod from returned by the index is not nil. This could help for race conditions when we try to get a pod during its deletion (the scenario I described above) but would not solve the problem is the index is out of sync
since there is no way to force an index resync (once it has invalid information from an IP it will never be cleaned) another option would be to change the recover strategy in the http handler (simplest option: not recover at all, if kube2iam panics because of an out of sync index it will crash and restart with a clean one)

lbernail on Mar 8, 2019

I found this issue because we encountered the issue solved by #173 (no credentials for pods in the terminating phase). Since the original issue has been hard on us, I think I’ll try to cherry-pick #173 on 0.10.4. I’ll let you know if it works or not for us (at least we could get a confirmation that the root cause is #173)

lbernail on Feb 18, 2019