cilium: Scale-down event of EKS kube-apiserver causes network outage
Is there an existing issue for this?
- I have searched the existing issues
What happened?
AWS EKS scales its kube-apiserver(s) out and in. I notice that whenever a kube-apiserver is removed, the cilium-agent and -operator continue to make API requests to this now decommissioned kube-apiserver. What’s annoying is that this kube-apiserver does not seem to close existing connections. Instead, it now returns a 401 Unauthorized as a response to all requests. This could also be considered an EKS bug. The result is that cilium can no longer create endpoints, allocate ENIs or IPs and that the CiliumLocalRedirectPolicy and DNS proxy seem to stop functioning; potentially causing a cluster-wide network outage until cilium connects to a working kube-apiserver
I know that it is caused by a kube-apiserver being removed because I am monitoring the endpoints in kubectl -n default get endpoints kubernetes
. Whenever an endpoint is removed, the problems start.
Restarting the agent and the operator restores functionality. Without a restart it takes about 5 minutes before cilium connects to a working endpoint.
From the operator:
cilium-operator-78d4dc7dbb-lv8q2 cilium-operator error retrieving resource lock kube-system/cilium-operator-resource-lock: Unauthorized
cilium-operator-78d4dc7dbb-lv8q2 cilium-operator level=error msg="error retrieving resource lock kube-system/cilium-operator-resource-lock: Unauthorized" subsys=klo
I don’t know exactly how to reproduce this. What I do know is that new EKS clusters see the most scaling events of the kube-apiserver. I reproduce it myself by creating a new cluster and then scheduling a lot of pods. This often leads to scaling events.
Our cilium is installed with the following helm values:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/os
operator: In
values:
- linux
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: eks.amazonaws.com/compute-type
operator: NotIn
values:
- fargate
bpf:
hostRouting: false
masquerade: true
tproxy: true
cni:
chainingMode: none
configMap: cni-configuration
customConf: true
devices: eth0
enableCiliumEndpointSlice: true
endpointHealthChecking:
enabled: false
eni:
awsEnablePrefixDelegation: true
ec2APIEndpoint: ec2.eu-west-1.amazonaws.com
enabled: true
instanceTagsFilter: aws:eks:cluster-name=ci-eks-4qrg9g
updateEC2AdapterLimitViaAPI: true
healthChecking: false
hubble:
enabled: true
eventBufferCapacity: "8191"
metrics:
enabled: null
serviceMonitor:
enabled: true
relay:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: k8s-app
operator: In
values:
- hubble-relay
topologyKey: kubernetes.io/hostname
enabled: true
podDisruptionBudget:
enabled: true
maxUnavailable: 1
replicas: 3
rollOutPods: true
ui:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: k8s-app
operator: In
values:
- hubble-ui
topologyKey: kubernetes.io/hostname
enabled: true
ingress:
annotations: {}
enabled: true
hosts:
- hubble.example.com
podDisruptionBudget:
enabled: true
maxUnavailable: 1
replicas: 3
rollOutPods: true
ipam:
mode: eni
k8sServiceHost: C0F975EE951307D359FFC679CA1FDD1F.sk1.eu-west-1.eks.amazonaws.com
k8sServicePort: 443
kubeProxyReplacement: strict
l7Proxy: true
loadBalancer:
serviceTopology: true
localRedirectPolicy: true
operator:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: io.cilium/app
operator: In
values:
- operator
- key: name
operator: In
values:
- cilium-operator
topologyKey: kubernetes.io/hostname
extraArgs:
- --unmanaged-pod-watcher-interval=0
podDisruptionBudget:
enabled: true
maxUnavailable: 1
priorityClassName: system-cluster-critical
prometheus:
enabled: true
serviceMonitor:
enabled: true
replicas: 3
rollOutPods: true
tolerations:
- key: node.cilium.io/agent-not-ready
operator: Exists
- key: node.kubernetes.io/not-ready
operator: Exists
priorityClassName: system-node-critical
prometheus:
enabled: true
serviceMonitor:
enabled: true
svcSourceRangeCheck: "false"
tunnel: disabled
updateStrategy:
type: OnDelete
Cilium Version
❯ cilium version
cilium-cli: 0.12.1 compiled with go1.18.5 on linux/amd64
cilium image (default): v1.12.0
cilium image (stable): v1.12.0
cilium image (running): v1.12.0
Kernel Version
Several:
5.15.54-25.126.amzn2.x86_64
5.4.204-113.362.amzn2.x86_64
Kubernetes Version
Client Version: v1.24.3
Kustomize Version: v4.5.4
Server Version: v1.22.11-eks-18ef993
Sysdump
No response
Relevant log output
No response
Anything else?
No response
Code of Conduct
- I agree to follow this project’s Code of Conduct
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 22
- Comments: 38 (28 by maintainers)
Commits related to this issue
- k8s: don't consider 4xx a successful interaction While a 404 Not Found or a 409 Conflict can be considered successful interactions with the k8s API, a blanket accept for all 4xx codes is problematic.... — committed to bimmlerd/cilium by bimmlerd 2 years ago
- k8s: don't consider 4xx a successful interaction While a 404 Not Found or a 409 Conflict can be considered successful interactions with the k8s API, a blanket accept for all 4xx codes is problematic.... — committed to cilium/cilium by bimmlerd 2 years ago
- k8s: don't consider 4xx a successful interaction [ upstream commit ffef1a85efe7f472b4d8f210cfd35e292d98be4a ] While a 404 Not Found or a 409 Conflict can be considered successful interactions with t... — committed to bimmlerd/cilium by bimmlerd 2 years ago
- feat: deleted the pods that are not unmanaged by Cilium Set operator to remove the label of a pod that existed before the node taint 1. Delete the specified label pod according to the parameter --po... — committed to lvyanru8200/cilium by lvyanru8200 2 years ago
- k8s: don't consider 4xx a successful interaction [ upstream commit ffef1a85efe7f472b4d8f210cfd35e292d98be4a ] While a 404 Not Found or a 409 Conflict can be considered successful interactions with t... — committed to pippolo84/cilium by bimmlerd 2 years ago
- k8s: don't consider 4xx a successful interaction [ upstream commit ffef1a85efe7f472b4d8f210cfd35e292d98be4a ] While a 404 Not Found or a 409 Conflict can be considered successful interactions with t... — committed to gandro/cilium by bimmlerd 2 years ago
- k8s: don't consider 4xx a successful interaction [ upstream commit ffef1a85efe7f472b4d8f210cfd35e292d98be4a ] While a 404 Not Found or a 409 Conflict can be considered successful interactions with t... — committed to cilium/cilium by bimmlerd 2 years ago
- k8s: don't consider 4xx a successful interaction [ upstream commit ffef1a85efe7f472b4d8f210cfd35e292d98be4a ] While a 404 Not Found or a 409 Conflict can be considered successful interactions with t... — committed to cilium/cilium by bimmlerd 2 years ago
- k8s: don't consider 4xx a successful interaction [ upstream commit ffef1a85efe7f472b4d8f210cfd35e292d98be4a ] While a 404 Not Found or a 409 Conflict can be considered successful interactions with t... — committed to cilium/cilium by bimmlerd 2 years ago
- endpoint, proxy: Fix deadlock with ipcache and K8s watcher See Sebastian's explanation for context. [1] To fix, we move the fetch of the DNS rules, which includes an ipcache lookup outside of the en... — committed to christarazi/cilium by christarazi a year ago
- endpoint, proxy: Fix deadlock with ipcache and K8s watcher See Sebastian's explanation for context. [1] To fix, we move the fetch of the DNS rules, which includes an ipcache lookup outside of the en... — committed to cilium/cilium by christarazi a year ago
Thanks to @mhulscher providing us with a sysdump, we (@bimmlerd and I) finally have some understanding as to what seems to be happening here. Unfortunately, the fix provided in #23377 is not enough to avoid the issue.
TL;DR: There seems to be a deadlock between IPCache and some endpoint lock. This seems to occur if there are L7 policies in place, some endpoint regeneration happens at an inopportune time, and the kube-apiserver endpoints change (which is why this can be triggered by an EKS scale-down event). Because multiple subsystems are affected by the deadlock, there are wide range of symptoms (proxied DNS lookups might fail, endpoint regeneration is stuck, and eventually the l7proxy will cause the DaemonSet’s liveness probe to fail after 5 minutes, causing an agent restart. After the agent restart, the symptoms should go away).
Looking at the gops threaddump, there are two goroutines which blocked on each other: The K8s endpoint watcher and an endpoint regeneration goroutine for the endpoint
0x4000f4b880
.The deadlock happens as follows:
The K8s endpoint watcher goroutine observes a change in the
kubernetes
endpoints. This results in it updating IPCache IP Identity Metadata labels for any IPs with thekube-apiserver
label (handleKubeAPIServerServiceEPChanges
). This calls into IPCache viaIPCache.removeLabelsFromIPs
, which in holds on to the IPCache mutex. It then spawns a new child go routine inEndpointManager.UpdatePolicyMaps
. Notably, IPCache waits on thisUpdatePolicyMaps
child go routine to return, while holding on to the IPCache mutex. TheUpdatePolicyMaps
goroutine iterates over all endpoints, attempting to lock them. Unfortunately, one of the endpoints in our stackdump, namely the one at addr0x4000f4b880
cannot be locked, since the locks seems in use by another go routine.That other go routine is the endpoint regeneration goroutine for ep
0x4000f4b880
. We don’t know why regeneration was triggered (potentially due to an unrelated endpoint deletion?) yet, but we see that regeneration is in progress. The regeneration callsEndpoint.runPreCompilationSteps
which holds on to the Endpoint 0x4000f4b880 mutex. It eventually calls intoProxy.CreateOrUpdateRedirect
, holds on to the proxy lock (which is why the l7probe eventually fails), and calls intoDNSProxy.GetRules
. TheDNSProxy.GetRules
function then needs to obtain the identities of certain IPs, and it does this by calling intoIPCache.LookupByIdentity
. This is where the deadlock happens.IPCache.LookupByIdentity
attempts to take the IPCache mutex. But that one is already taken by the “K8s endpoint watcher” goroutine, which is waiting on us to release the lock for endpoint 0x4000f4b880. A classical deadlock.Unfortunately, we don’t know yet how to fix this. The newer code here is the
IPCache.removeLabelsFromIPs
, so it might be possible to fix it there somehow. We’ll continue to work on a solution.Stacktrace for the first goroutine ("the K8s endpoint watcher"):
Stacktrace for the second go routine ("the ep 0x4000f4b880 regeneration go routine").:
I am eagerly awaiting the new patch release. If I encounter the problem again I’ll try and make a sysdump. I’ll reach out on slack to see if I can share it privately.
@joestringer: this is not hard to reproduce! Turn on an EKS cluster, install Cilium v1.12, and then put sufficient load on the API server that EKS scales it up. The last part is the only part that is somewhat nondeterministic, but I have to imagine that creating a few thousand pods in a tight loop would do the trick every time.
This was described by the original issue poster here:
Do you do production testing of Cilium on EKS?
While trying to reproduce this on 1.11.8 I found that cilium also received 401 Unauthorized from removed kube-apiservers. However, it doesn’t result in cilium removing identities/endpoints/etc and thus doesn’t cause a network outage.
For now we are going to rollback to 1.11.8. I will try and prepare a sysdump of 1.12 while the issue presents itself.
Just to re-confirm: I did not experience this issue w/ cilium 1.10 or 1.11.
@bimmlerd Of course I want to provide some more information. A tl;dr upfront though: it seems that we are not affected by this.
Longer version: We experienced communication problems for workloads as described by @mhulscher in the initial description. This was on EKS 1.23 clusters running 1.10.10 and 1.10.16. The error msgs from the operator corresponded to the error msgs from the initial issue description. We also saw similar issues as mentioned in https://github.com/aws/containers-roadmap/issues/1810. Hence me adding myself to this thread before being able to look closer into this. I just wanted to loose track of this here.
As it now turns out, after looking more into this yesterday, we see that we get this problems only while enabling/associating an oidc identity provider for the cluster. After the enabling process has finished the error msgs disappear and network connectivity is restored. At least from some few initial tests yesterday.
Sorry for causing confusion.
The issue is not present in at least 1.11.{7,8,9}
@recollir could you be a bit more specific in terms of what you saw on 1.10.16? So far we’ve been working with the assumption that problematic behaviour (i.e. cilium removing identities/endpoints/etc, as pointed out in https://github.com/cilium/cilium/issues/20915#issuecomment-1219363906) has been introduced in 1.12.
In general, it would be helpful if affected parties could tell us which symptoms they are experiencing, we are currently differentiating between the following:
As a note, 1. is mitigated by https://github.com/cilium/cilium/pull/22393.
@gandro perhaps you could consider creating an EKS 1.23 cluster w/ cilium 1.12, then upgrading EKS to 1.24. This will cause the kube-apiservers to be replaced.