kube2iam: Interface conversion fails, causing pods not to get credentials.

Sometimes, pods in my cluster fail with errors like “fatal error: Unable to locate credentials”. Checking the kube2iam logs for the instance on the same node, I see entries like this:

time="2018-11-20T18:40:26Z" level=error msg="PANIC error processing request: interface conversion: interface {} is nil, not *v1.Pod" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=172.19.97.48 res.status=500

This occurs on an EKS cluster running Kubernetes 1.10.3.

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Reactions: 40
  • Comments: 49 (4 by maintainers)

Commits related to this issue

Most upvoted comments

rolled back to 0.10.4 error is disappeared. i won’t be using “latest” i won’t be using “latest” repaet 10 times 😃

getting this on EKS 1.11.5

If someone still has the error:

  File "/usr/local/lib/python3.7/site-packages/botocore/signers.py", line 157, in sign
    auth.add_auth(request)
  File "/usr/local/lib/python3.7/site-packages/botocore/auth.py", line 357, in add_auth
    raise NoCredentialsError
botocore.exceptions.NoCredentialsError: Unable to locate credentials

I did set the env variable AWS_METADATA_SERVICE_TIMEOUT=3 in the pod and it solves the problem. What happens sometimes is that with kube2iam the metadata connexion takes more than 1 second to answer, so the default timeout for metadata service is reached and since there is no other authentication method we have this error.

Is this really fixed? I still have issue with 0.10.7 version. I had no issues with 0.10.4…

Quick update: we have been running a forked of the code with this for isPodActive:

	return p.Status.PodIP != "" &&
		v1.PodSucceeded != p.Status.Phase &&
		v1.PodFailed != p.Status.Phase

We haven’t had any issue so far.

We still need to test for PodSucceeded and PodFailed because pods in this status keep their IP in etcd but it can be reallocated by the runtime/cni because the kubelet has deleted the sandbox. In this case, we catch an update with the phase change which means the IP from the previous status will be deleted from the index (https://github.com/kubernetes/client-go/blob/master/tools/cache/thread_safe_store.go#L253-L255) and since indexFunc will return nil nothing will be added back.

Similarly, we’ve seen some update to the pod status where the IP is removed before the deletion event. Since indexFunc also returns nil for podIP=="" this means this update will also remove the IP the IP from the index.

As an extra precaution, we could add some logic in PodByIP to check that the pod is not nil (which could happen if the index end up in an inconsistent state, which should not happen except maybe in some edge cases such as out of order events) and that it is not Succeeded or Failed. We would have to do that by filtering over all pods in the set (in case we have a former Succeeded pod and the Running one indexed at the same IP for instance).

Can also confirm in EKS seeing this issue, going from tag:latest to tag:0.10.4 100% fixed this for us.

@rifelpet I have an idea for the issue before #173 I think the issue could happen on force deletes (I haven’t tested it though)

@rifelpet Thank you for the PR. I can’t easily think of a scenario where this would happen before #173 : it would be the pod has been deleted from the cache and the index is stale. I wonder if there could be a race condition in the deletion event:

  • indexer deletion is triggered
  • pod is deleted from the cache
  • we receive a get credential request
  • indexer deletion finishes

I’ll need to dig into client-go to this if this can happen

A few ideas to work around the issue:

  • update client-go to the latest version (a few fixes/improvements in the indexer in more recent version)
  • verify that the pod from returned by the index is not nil. This could help for race conditions when we try to get a pod during its deletion (the scenario I described above) but would not solve the problem is the index is out of sync
  • since there is no way to force an index resync (once it has invalid information from an IP it will never be cleaned) another option would be to change the recover strategy in the http handler (simplest option: not recover at all, if kube2iam panics because of an out of sync index it will crash and restart with a clean one)

I found this issue because we encountered the issue solved by #173 (no credentials for pods in the terminating phase). Since the original issue has been hard on us, I think I’ll try to cherry-pick #173 on 0.10.4. I’ll let you know if it works or not for us (at least we could get a confirmation that the root cause is #173)