kops: Enabling IRSA on self hosted k8s cluster with kops runs into a problem with cilium pods
/kind bug
1. What kops version are you running? The command kops version, will display
this information.
Version 1.23.2
2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
Version 1.21.5
3. What cloud provider are you using?
Self Hosted K8S cluster on AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
Enabling serviceAccountIssuerDiscovery with enableAWSOIDCProvider: true and setting a bucket for JWKS, using kops rolling-update
5. What happened after the commands executed?
kops decided that 3 master nodes (3 is total) need to get updated. Once it started updating first master node, it is stuck because cilium pod running on that master is getting following errors:
level=info msg="Initializing daemon" subsys=daemon
level=info msg="Establishing connection to apiserver" host="https://internal.api.address:443" subsys=k8s
level=info msg="Establishing connection to apiserver" host="https://internal.api.address:443" subsys=k8s
level=info msg="Establishing connection to apiserver" host="https://internal.api.address:443" subsys=k8s
level=info msg="Establishing connection to apiserver" host="https://internal.api.address:443" subsys=k8s
level=info msg="Establishing connection to apiserver" host="https://internal.api.address:443" subsys=k8s
level=info msg="Establishing connection to apiserver" host="https://internal.api.address:443" subsys=k8s
level=info msg="Establishing connection to apiserver" host="https://internal.api.address:443" subsys=k8s
level=info msg="Establishing connection to apiserver" host="https://internal.api.address:443" subsys=k8s
level=error msg="Unable to contact k8s api-server" error=Unauthorized ipAddr="https://internal.api.address:443" subsys=k8s
level=fatal msg="Unable to initialize Kubernetes subsystem" error="unable to create k8s client: unable to create k8s client: Unauthorized" subsys=daemon
This will eventually make other cilium pods go into CrashLoop with following message :
level=fatal msg="Unable to initialize Kubernetes subsystem" error="the server has asked for the client to provide credentials" subsys=daemon
In the end the whole cluster becomes unusable as all pods stop working at some point
6. What did you expect to happen?
That the cluster will start normally after kops rolling-update, and start using new serviceAccountIssuer using OIDC for enabling IAM Roles for Service Accounts
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
creationTimestamp: "2021-10-11T16:04:09Z"
generation: 17
name: REDACTED
spec:
api:
loadBalancer:
class: Network
type: Internal
authorization:
rbac: {}
certManager:
enabled: true
managed: false
channel: stable
cloudProvider: aws
configBase: REDACTED
containerRuntime: containerd
etcdClusters:
- cpuRequest: 200m
etcdMembers:
- encryptedVolume: true
instanceGroup: master-aws-region-1a
name: a
- encryptedVolume: true
instanceGroup: master-aws-region-1b
name: b
- encryptedVolume: true
instanceGroup: master-aws-region-1c
name: c
memoryRequest: 100Mi
name: main
- cpuRequest: 100m
etcdMembers:
- encryptedVolume: true
instanceGroup: master-aws-region-1a
name: a
- encryptedVolume: true
instanceGroup: master-aws-region-1b
name: b
- encryptedVolume: true
instanceGroup: master-aws-region-1c
name: c
memoryRequest: 100Mi
name: events
externalPolicies:
master:
- REDACTED
- REDACTED
node:
- REDACTED
- REDACTED
iam:
allowContainerRegistry: true
legacy: false
kubeProxy:
enabled: false
kubelet:
anonymousAuth: false
kubernetesApiAccess:
- 0.0.0.0/0
kubernetesVersion: 1.21.5
masterInternalName: REDACTED
masterPublicName: REDACTED
metricsServer:
enabled: true
insecure: false
networkCIDR: 172.20.0.0/16
networking:
cilium:
enableNodePort: true
nonMasqueradeCIDR: 100.64.0.0/10
podIdentityWebhook:
enabled: true
serviceAccountIssuerDiscovery:
discoveryStore: REDACTED
enableAWSOIDCProvider: true
sshAccess:
- 0.0.0.0/0
subnets:
- cidr: 172.20.32.0/19
name: aws-region-1a
type: Private
zone: aws-region-1a
- cidr: 172.20.64.0/19
name: aws-region-1b
type: Private
zone: aws-region-1b
- cidr: 172.20.96.0/19
name: aws-region-1c
type: Private
zone: aws-region-1c
- cidr: 172.20.0.0/22
name: utility-aws-region-1a
type: Utility
zone: aws-region-1a
- cidr: 172.20.4.0/22
name: utility-aws-region-1b
type: Utility
zone: aws-region-1b
- cidr: 172.20.8.0/22
name: utility-aws-region-1c
type: Utility
zone: aws-region-1c
topology:
bastion:
bastionPublicName: REDACTED
dns:
type: Public
masters: private
nodes: private
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2021-10-11T16:04:10Z"
labels:
kops.k8s.io/cluster: REDACTED
name: bastions
spec:
associatePublicIp: false
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20211001
machineType: t3.micro
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: bastions
role: Bastion
subnets:
- aws-region-1a
- aws-region-1b
- aws-region-1c
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2021-10-11T16:04:09Z"
labels:
kops.k8s.io/cluster: REDACTED
name: master-aws-region-1a
spec:
associatePublicIp: false
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20211001
machineType: t3.medium
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: master-aws-region-1a
role: Master
subnets:
- aws-region-1a
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2021-10-11T16:04:09Z"
labels:
kops.k8s.io/cluster: REDACTED
name: master-aws-region-1b
spec:
associatePublicIp: false
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20211001
machineType: t3.medium
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: master-aws-region-1b
role: Master
subnets:
- aws-region-1b
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2021-10-11T16:04:10Z"
labels:
kops.k8s.io/cluster: REDACTED
name: master-aws-region-1c
spec:
associatePublicIp: false
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20211001
machineType: t3.medium
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: master-aws-region-1c
role: Master
subnets:
- aws-region-1c
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2021-10-11T16:04:10Z"
labels:
kops.k8s.io/cluster: REDACTED
name: nodes-aws-region-1a
spec:
associatePublicIp: false
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20211001
machineType: t3.large
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: nodes-aws-region-1a
role: Node
subnets:
- aws-region-1a
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2021-10-11T16:04:10Z"
labels:
kops.k8s.io/cluster: REDACTED
name: nodes-aws-region-1b
spec:
associatePublicIp: false
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20211001
machineType: t3.large
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: nodes-aws-region-1b
role: Node
subnets:
- aws-region-1b
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2021-10-11T16:04:10Z"
labels:
kops.k8s.io/cluster: REDACTED
name: nodes-aws-region-1c
spec:
associatePublicIp: false
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20211001
machineType: t3.large
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: nodes-aws-region-1c
role: Node
subnets:
- aws-region-1c
8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
kubelet logs are following:
E0828 20:35:00.314637 6170 server.go:273] "Unable to authenticate the request due to an error" err="invalid bearer token"
E0828 20:35:01.617126 6170 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"cilium-agent\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=cilium-agent pod=cilium-v8h2f_kube-system(bb885613-e4df-4bcb-9d83-7c22cb91d446)\"" pod="kube-system/cilium-v8h2f" podUID=bb885613-e4df-4bcb-9d83-7c22cb91d446
, and alsom
I0828 21:01:39.695536 6170 prober.go:116] "Probe failed" probeType="Startup" pod="kube-system/cilium-xl2nv" podUID=5b8ad1c3-a715-4670-82e0-550585372083 containerName="cilium-agent" probeResult=failure output="Get \"http://127.0.0.1:9876/healthz\": dial tcp 127.0.0.1:9876: connect: connection refused"
Other cilium pods (ones that were not on updated master node), eventually get following error messages:
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/synced/crd.go:131: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Unauthorized" subsys=k8s
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/synced/crd.go:131: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Unauthorized" subsys=k8s
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/synced/crd.go:131: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Unauthorized" subsys=k8s
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/synced/crd.go:131: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Unauthorized" subsys=k8s
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/synced/crd.go:131: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Unauthorized" subsys=k8s
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/synced/crd.go:131: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Unauthorized" subsys=k8s
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/synced/crd.go:131: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Unauthorized" subsys=k8s
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/synced/crd.go:131: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Unauthorized" subsys=k8s
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/synced/crd.go:131: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Unauthorized" subsys=k8s
level=warning msg="Network status error received, restarting client connections" error=Unauthorized subsys=k8s
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/synced/crd.go:131: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Unauthorized" subsys=k8s
level=info msg="Exiting due to signal" signal=terminated subsys=daemon
level=fatal msg="Error while creating daemon" error="context canceled" subsys=daemon
level=info msg="Waiting for all endpoints' go routines to be stopped." subsys=daemon
level=info msg="All endpoints' goroutines stopped." subsys=daemon
9. Anything else do we need to know?
I was following https://dev.to/olemarkus/irsa-support-for-kops-1doe and https://dev.to/olemarkus/zero-configuration-irsa-on-kops-1po1 for enabling IRSA for self hosted k8s clusters.
One difference is that we have cert-manager installed prior to trying to enable this. That is why spec.certManager has managed: false config.
Also, kube-apiserver on updated master node never gets deployed in mean time.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 2
- Comments: 17 (6 by maintainers)
The problem we faced was that after deploying
pod-identity-webhooksat the same time as OIDC changes,ciliumpods won’t start aspod-identity-webhooksare not yet available since masters were not rolled. You can exclude “cilium” on mutating webhook related topod-identity-webhooks, but you still need to roll masters.We have found a non-disruptive two step process below:
kops update cluster --yesyou will shortly seepod-identity-webhooksstarting up, but it takes ~10 mins for them to become fully operational after which IRSA should be working).Hope it helps someone who stumbles across the same issue.
hi @olemarkus
Is it possible to add a feature to include multiple
--service-account-issueras an option for kops? that can be helpful and reduce the disruption during the adoption of IRSA.https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#serviceaccount-token-volume-projection