kiam: Unable to assign role due to "pod not found" error

This issue resembles #46, and might be essentially the same one.

We’ve switched to kiam from kube2iam a couple of weeks ago. We’re running most of our kubernetes nodes on spot instances, so they are being replaced pretty frequently. When the node is drained before termination, the pods are scheduled on other nodes, and sometimes the application they’re running starts throwing the following error:

unable to sign request without credentials set

On the agent side I see the following errors:

{
    "addr": "100.96.143.201:37862",
    "level": "error",
    "method": "GET",
    "msg": "error processing request: rpc error: code = Unknown desc = pod not found",
    "path": "/latest/meta-data/iam/security-credentials/",
    "status": 500,
    "time": "2018-03-15T18:59:02Z"
}

And those are errors on the server side:

{
    "level": "error",
    "msg": "error finding pod: pod not found",
    "pod.ip": "100.96.143.201",
    "time": "2018-03-15T18:59:01Z"
}

It looks like we have a delay between the time pod is scheduled and the time information about it becomes available to KIAM. If I delay the startup of the app for couple of seconds it fixes the problem. Deleting the problematic pod and letting kubernetes reschedule it does the trick as well.

After looking into your code it seems that increasing the prefetch-buffer-size value might help, cause the issue mostly happens when many pods are scheduled at the same time. But maybe I’m missing something.

Any advice would be greatly appreciated.

P.S.: We’re using kiam:v2.5 and kubernetes v1.8.6.

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 20 (10 by maintainers)

Most upvoted comments

To get around errors with the pod cache not being primed i inject the following ENV variables into ANY container thats going to access AWS from my k8s cluster

AWS_METADATA_SERVICE_TIMEOUT=5
AWS_METADATA_SERVICE_NUM_ATTEMPTS=20

https://boto3.readthedocs.io/en/latest/guide/configuration.html

boto3 and aws cli obeys these params and basicly this setting gives the app much more retries to get what it needs during the cache priming.

before I added these env variables, many of my pods would always fail at the first start since many of them start by getting a file or two from a S3 drive.

with this it was smooth sailing and pods comes up on first attempt without errors or pod restarts

roffe on Aug 17, 2018