cilium: Scale-down event of EKS kube-apiserver causes network outage

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

AWS EKS scales its kube-apiserver(s) out and in. I notice that whenever a kube-apiserver is removed, the cilium-agent and -operator continue to make API requests to this now decommissioned kube-apiserver. What’s annoying is that this kube-apiserver does not seem to close existing connections. Instead, it now returns a 401 Unauthorized as a response to all requests. This could also be considered an EKS bug. The result is that cilium can no longer create endpoints, allocate ENIs or IPs and that the CiliumLocalRedirectPolicy and DNS proxy seem to stop functioning; potentially causing a cluster-wide network outage until cilium connects to a working kube-apiserver

I know that it is caused by a kube-apiserver being removed because I am monitoring the endpoints in kubectl -n default get endpoints kubernetes. Whenever an endpoint is removed, the problems start.

Restarting the agent and the operator restores functionality. Without a restart it takes about 5 minutes before cilium connects to a working endpoint.

From the operator:

cilium-operator-78d4dc7dbb-lv8q2 cilium-operator error retrieving resource lock kube-system/cilium-operator-resource-lock: Unauthorized
cilium-operator-78d4dc7dbb-lv8q2 cilium-operator level=error msg="error retrieving resource lock kube-system/cilium-operator-resource-lock: Unauthorized" subsys=klo

I don’t know exactly how to reproduce this. What I do know is that new EKS clusters see the most scaling events of the kube-apiserver. I reproduce it myself by creating a new cluster and then scheduling a lot of pods. This often leads to scaling events.

Our cilium is installed with the following helm values:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/os
          operator: In
          values:
          - linux
        - key: kubernetes.io/arch
          operator: In
          values:
          - amd64
        - key: eks.amazonaws.com/compute-type
          operator: NotIn
          values:
          - fargate
bpf:
  hostRouting: false
  masquerade: true
  tproxy: true
cni:
  chainingMode: none
  configMap: cni-configuration
  customConf: true
devices: eth0
enableCiliumEndpointSlice: true
endpointHealthChecking:
  enabled: false
eni:
  awsEnablePrefixDelegation: true
  ec2APIEndpoint: ec2.eu-west-1.amazonaws.com
  enabled: true
  instanceTagsFilter: aws:eks:cluster-name=ci-eks-4qrg9g
  updateEC2AdapterLimitViaAPI: true
healthChecking: false
hubble:
  enabled: true
  eventBufferCapacity: "8191"
  metrics:
    enabled: null
    serviceMonitor:
      enabled: true
  relay:
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
            - key: k8s-app
              operator: In
              values:
              - hubble-relay
          topologyKey: kubernetes.io/hostname
    enabled: true
    podDisruptionBudget:
      enabled: true
      maxUnavailable: 1
    replicas: 3
    rollOutPods: true
  ui:
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
            - key: k8s-app
              operator: In
              values:
              - hubble-ui
          topologyKey: kubernetes.io/hostname
    enabled: true
    ingress:
      annotations: {}
      enabled: true
      hosts:
      - hubble.example.com
    podDisruptionBudget:
      enabled: true
      maxUnavailable: 1
    replicas: 3
    rollOutPods: true
ipam:
  mode: eni
k8sServiceHost: C0F975EE951307D359FFC679CA1FDD1F.sk1.eu-west-1.eks.amazonaws.com
k8sServicePort: 443
kubeProxyReplacement: strict
l7Proxy: true
loadBalancer:
  serviceTopology: true
localRedirectPolicy: true
operator:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: io.cilium/app
            operator: In
            values:
            - operator
          - key: name
            operator: In
            values:
            - cilium-operator
        topologyKey: kubernetes.io/hostname
  extraArgs:
  - --unmanaged-pod-watcher-interval=0
  podDisruptionBudget:
    enabled: true
    maxUnavailable: 1
  priorityClassName: system-cluster-critical
  prometheus:
    enabled: true
    serviceMonitor:
      enabled: true
  replicas: 3
  rollOutPods: true
  tolerations:
  - key: node.cilium.io/agent-not-ready
    operator: Exists
  - key: node.kubernetes.io/not-ready
    operator: Exists
priorityClassName: system-node-critical
prometheus:
  enabled: true
  serviceMonitor:
    enabled: true
svcSourceRangeCheck: "false"
tunnel: disabled
updateStrategy:
  type: OnDelete

Cilium Version

❯ cilium version
cilium-cli: 0.12.1 compiled with go1.18.5 on linux/amd64
cilium image (default): v1.12.0
cilium image (stable): v1.12.0
cilium image (running): v1.12.0

Kernel Version

Several:

5.15.54-25.126.amzn2.x86_64
5.4.204-113.362.amzn2.x86_64

Kubernetes Version

Client Version: v1.24.3
Kustomize Version: v4.5.4
Server Version: v1.22.11-eks-18ef993

Sysdump

No response

Relevant log output

No response

Anything else?

No response

Code of Conduct

  • I agree to follow this project’s Code of Conduct

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 22
  • Comments: 38 (28 by maintainers)

Commits related to this issue

Most upvoted comments

Thanks to @mhulscher providing us with a sysdump, we (@bimmlerd and I) finally have some understanding as to what seems to be happening here. Unfortunately, the fix provided in #23377 is not enough to avoid the issue.

TL;DR: There seems to be a deadlock between IPCache and some endpoint lock. This seems to occur if there are L7 policies in place, some endpoint regeneration happens at an inopportune time, and the kube-apiserver endpoints change (which is why this can be triggered by an EKS scale-down event). Because multiple subsystems are affected by the deadlock, there are wide range of symptoms (proxied DNS lookups might fail, endpoint regeneration is stuck, and eventually the l7proxy will cause the DaemonSet’s liveness probe to fail after 5 minutes, causing an agent restart. After the agent restart, the symptoms should go away).

Looking at the gops threaddump, there are two goroutines which blocked on each other: The K8s endpoint watcher and an endpoint regeneration goroutine for the endpoint 0x4000f4b880.

The deadlock happens as follows:

  1. The K8s endpoint watcher goroutine observes a change in the kubernetes endpoints. This results in it updating IPCache IP Identity Metadata labels for any IPs with the kube-apiserver label (handleKubeAPIServerServiceEPChanges). This calls into IPCache via IPCache.removeLabelsFromIPs, which in holds on to the IPCache mutex. It then spawns a new child go routine in EndpointManager.UpdatePolicyMaps. Notably, IPCache waits on this UpdatePolicyMaps child go routine to return, while holding on to the IPCache mutex. The UpdatePolicyMaps goroutine iterates over all endpoints, attempting to lock them. Unfortunately, one of the endpoints in our stackdump, namely the one at addr 0x4000f4b880 cannot be locked, since the locks seems in use by another go routine.

  2. That other go routine is the endpoint regeneration goroutine for ep 0x4000f4b880. We don’t know why regeneration was triggered (potentially due to an unrelated endpoint deletion?) yet, but we see that regeneration is in progress. The regeneration calls Endpoint.runPreCompilationSteps which holds on to the Endpoint 0x4000f4b880 mutex. It eventually calls into Proxy.CreateOrUpdateRedirect, holds on to the proxy lock (which is why the l7probe eventually fails), and calls into DNSProxy.GetRules. The DNSProxy.GetRules function then needs to obtain the identities of certain IPs, and it does this by calling into IPCache.LookupByIdentity. This is where the deadlock happens. IPCache.LookupByIdentity attempts to take the IPCache mutex. But that one is already taken by the “K8s endpoint watcher” goroutine, which is waiting on us to release the lock for endpoint 0x4000f4b880. A classical deadlock.

Unfortunately, we don’t know yet how to fix this. The newer code here is the IPCache.removeLabelsFromIPs, so it might be possible to fix it there somehow. We’ll continue to work on a solution.

Stacktrace for the first goroutine ("the K8s endpoint watcher"):
goroutine 24422 [semacquire, 2 minutes]:
sync.runtime_SemacquireMutex(0x4000cd2d08?, 0xc0?, 0x4000cd2d28?)
	/usr/local/go/src/runtime/sema.go:71 +0x28
sync.(*Mutex).lockSlow(0x4000f4b8d0)
	/usr/local/go/src/sync/mutex.go:162 +0x180
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
sync.(*RWMutex).Lock(0x4000f4b8d0?)
	/usr/local/go/src/sync/rwmutex.go:139 +0x7c
github.com/cilium/cilium/pkg/endpoint.(*Endpoint).lockAlive(0x4000f4b880)
	/go/src/github.com/cilium/cilium/pkg/endpoint/lock.go:14 +0x34
github.com/cilium/cilium/pkg/endpoint.(*Endpoint).ApplyPolicyMapChanges(0x4000f4b880, 0x77d0c?)
	/go/src/github.com/cilium/cilium/pkg/endpoint/bpf.go:1165 +0x34
github.com/cilium/cilium/pkg/endpointmanager.(*EndpointManager).UpdatePolicyMaps.func2(0x4000965500?)
	/go/src/github.com/cilium/cilium/pkg/endpointmanager/manager.go:171 +0x50
created by github.com/cilium/cilium/pkg/endpointmanager.(*EndpointManager).UpdatePolicyMaps
	/go/src/github.com/cilium/cilium/pkg/endpointmanager/manager.go:168 +0x158
	
	|| vv connected via go routine spawn ^^ ||

goroutine 249 [semacquire, 2 minutes]:
sync.runtime_Semacquire(0x4002c56130?)
	/usr/local/go/src/runtime/sema.go:56 +0x2c
sync.(*WaitGroup).Wait(0x4002c56130)
	/usr/local/go/src/sync/waitgroup.go:136 +0x88
github.com/cilium/cilium/pkg/ipcache.(*IPCache).UpdatePolicyMaps(0x40003a3e80, {0x2e9a3c0, 0x400018e008}, 0xb?, 0x40040c3170)
	/go/src/github.com/cilium/cilium/pkg/ipcache/metadata.go:235 +0xcc
github.com/cilium/cilium/pkg/ipcache.(*IPCache).removeLabelsFromIPs(0x40003a3e80, 0x4001c87758?, {0x295c64b, 0xf})
	/go/src/github.com/cilium/cilium/pkg/ipcache/metadata.go:414 +0x618
github.com/cilium/cilium/pkg/ipcache.(*IPCache).RemoveLabelsExcluded(0x40003a3e80, 0x40004a1c50, 0x4001c878b8?, {0x295c64b, 0xf})
	/go/src/github.com/cilium/cilium/pkg/ipcache/metadata.go:328 +0x194
github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcher).handleKubeAPIServerServiceEPChanges(0x40001a7400, 0x4001c87988?)
	/go/src/github.com/cilium/cilium/pkg/k8s/watchers/endpoint.go:135 +0x50
github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcher).addKubeAPIServerServiceEPSliceV1(0x4000b76300?, 0x40050b10a0)
	/go/src/github.com/cilium/cilium/pkg/k8s/watchers/endpoint_slice.go:205 +0x3c8
github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcher).updateK8sEndpointSliceV1(0x40001a7400, 0x40050b10a0?, 0x40050b10a0?)
	/go/src/github.com/cilium/cilium/pkg/k8s/watchers/endpoint_slice.go:178 +0x6c
github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcher).endpointSlicesInit.func2({0x28ee5c0?, 0x4000d43960?}, {0x28ee5c0, 0x40050b10a0})
	/go/src/github.com/cilium/cilium/pkg/k8s/watchers/endpoint_slice.go:71 +0xd8
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...)
	/go/src/github.com/cilium/cilium/vendor/k8s.io/client-go/tools/cache/controller.go:239
github.com/cilium/cilium/pkg/k8s/informer.NewInformerWithStore.func1({0x2475e40?, 0x4000a51050?})
	/go/src/github.com/cilium/cilium/pkg/k8s/informer/informer.go:118 +0x2b0
k8s.io/client-go/tools/cache.(*DeltaFIFO).Pop(0x4000b265a0, 0x4000354700)
	/go/src/github.com/cilium/cilium/vendor/k8s.io/client-go/tools/cache/delta_fifo.go:554 +0x50c
k8s.io/client-go/tools/cache.(*controller).processLoop(0x40004e3680)
	/go/src/github.com/cilium/cilium/vendor/k8s.io/client-go/tools/cache/controller.go:184 +0x40
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x45b8?)
	/go/src/github.com/cilium/cilium/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x44
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x1?, {0x2e6fc60, 0x4000ad4630}, 0x1, 0x400012fd40)
	/go/src/github.com/cilium/cilium/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0x94
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x40004e36e8?, 0x3b9aca00, 0x0, 0xa0?, 0xffff89776e68?)
	/go/src/github.com/cilium/cilium/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x84
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go/src/github.com/cilium/cilium/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*controller).Run(0x40004e3680, 0x400012fd40)
	/go/src/github.com/cilium/cilium/vendor/k8s.io/client-go/tools/cache/controller.go:155 +0x28c
created by github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcher).endpointSlicesInit
	/go/src/github.com/cilium/cilium/pkg/k8s/watchers/endpoint_slice.go:156 +0x63c
Stacktrace for the second go routine ("the ep 0x4000f4b880 regeneration go routine").:
goroutine 2529 [select, 2 minutes]:
golang.org/x/sync/semaphore.(*Weighted).Acquire(0x4000476190, {0x2e9a3c0, 0x400018e000}, 0x1)
	/go/src/github.com/cilium/cilium/vendor/golang.org/x/sync/semaphore/semaphore.go:60 +0x410
github.com/cilium/cilium/pkg/lock.(*SemaphoredMutex).RLock(...)
	/go/src/github.com/cilium/cilium/pkg/lock/semaphored_mutex.go:46
github.com/cilium/cilium/pkg/ipcache.(*IPCache).LookupByIdentity(0x40003a3e80, 0x1fcf910?)
	/go/src/github.com/cilium/cilium/pkg/ipcache/ipcache.go:639 +0x64
github.com/cilium/cilium/daemon/cmd.(*Daemon).lookupIPsBySecID(0x400133f7a8?, 0x1fcf7f0?)
	/go/src/github.com/cilium/cilium/daemon/cmd/fqdn.go:404 +0x24
github.com/cilium/cilium/pkg/fqdn/dnsproxy.(*DNSProxy).GetRules(0x4000779a40, 0x723)
	/go/src/github.com/cilium/cilium/pkg/fqdn/dnsproxy/proxy.go:200 +0x35c
github.com/cilium/cilium/pkg/proxy.(*dnsRedirect).setRules(0x4000125400, 0x0?, 0x400334af30)
	/go/src/github.com/cilium/cilium/pkg/proxy/dns.go:43 +0x1c4
github.com/cilium/cilium/pkg/proxy.(*dnsRedirect).UpdateRules(0x4000125400, 0x0?)
	/go/src/github.com/cilium/cilium/pkg/proxy/dns.go:58 +0x34
github.com/cilium/cilium/pkg/proxy.(*Proxy).CreateOrUpdateRedirect(0x40001a1d40, {0x2e9a388?, 0x4001549800}, {0x2e9b380, 0x40047fb580}, {0x4003897890, 0x12}, {0x2eaf9c8, 0x4000f4b880}, 0x40036e93c0)
	/go/src/github.com/cilium/cilium/pkg/proxy/proxy.go:504 +0x674
github.com/cilium/cilium/pkg/endpoint.(*Endpoint).addNewRedirectsFromDesiredPolicy(0x4000f4b880, 0x0, 0x2935fbb?, 0x6?)
	/go/src/github.com/cilium/cilium/pkg/endpoint/bpf.go:253 +0x238
github.com/cilium/cilium/pkg/endpoint.(*Endpoint).addNewRedirects(0x4000f4b880, 0x1ca98?)
	/go/src/github.com/cilium/cilium/pkg/endpoint/bpf.go:458 +0x18c
github.com/cilium/cilium/pkg/endpoint.(*Endpoint).runPreCompilationSteps(0x4000f4b880, 0x4002ec4c00)
	/go/src/github.com/cilium/cilium/pkg/endpoint/bpf.go:828 +0x384
github.com/cilium/cilium/pkg/endpoint.(*Endpoint).regenerateBPF(0x4000f4b880, 0x4002ec4c00)
	/go/src/github.com/cilium/cilium/pkg/endpoint/bpf.go:579 +0x14c
github.com/cilium/cilium/pkg/endpoint.(*Endpoint).regenerate(0x4000f4b880, 0x4002ec4c00)
	/go/src/github.com/cilium/cilium/pkg/endpoint/policy.go:398 +0x5fc
github.com/cilium/cilium/pkg/endpoint.(*EndpointRegenerationEvent).Handle(0x4002ca7580, 0x15548?)
	/go/src/github.com/cilium/cilium/pkg/endpoint/events.go:53 +0x2b0
github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).run.func1()
	/go/src/github.com/cilium/cilium/pkg/eventqueue/eventqueue.go:245 +0x10c
sync.(*Once).doSlow(0x2ea3e98?, 0x4000f09500?)
	/usr/local/go/src/sync/once.go:68 +0x10c
sync.(*Once).Do(...)
	/usr/local/go/src/sync/once.go:59
github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).run(0x2ea3e98?)
	/go/src/github.com/cilium/cilium/pkg/eventqueue/eventqueue.go:233 +0x4c
created by github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).Run
	/go/src/github.com/cilium/cilium/pkg/eventqueue/eventqueue.go:229 +0x80

I am eagerly awaiting the new patch release. If I encounter the problem again I’ll try and make a sysdump. I’ll reach out on slack to see if I can share it privately.

Currently without a reliable set of instructions to reproduce, it’s a bit hard to investigate or confirm what the issue is.

@joestringer: this is not hard to reproduce! Turn on an EKS cluster, install Cilium v1.12, and then put sufficient load on the API server that EKS scales it up. The last part is the only part that is somewhat nondeterministic, but I have to imagine that creating a few thousand pods in a tight loop would do the trick every time.

This was described by the original issue poster here:

I reproduce it myself by creating a new cluster and then scheduling a lot of pods. This often leads to scaling events.

Do you do production testing of Cilium on EKS?

While trying to reproduce this on 1.11.8 I found that cilium also received 401 Unauthorized from removed kube-apiservers. However, it doesn’t result in cilium removing identities/endpoints/etc and thus doesn’t cause a network outage.

For now we are going to rollback to 1.11.8. I will try and prepare a sysdump of 1.12 while the issue presents itself.

Just to re-confirm: I did not experience this issue w/ cilium 1.10 or 1.11.

@bimmlerd Of course I want to provide some more information. A tl;dr upfront though: it seems that we are not affected by this.

Longer version: We experienced communication problems for workloads as described by @mhulscher in the initial description. This was on EKS 1.23 clusters running 1.10.10 and 1.10.16. The error msgs from the operator corresponded to the error msgs from the initial issue description. We also saw similar issues as mentioned in https://github.com/aws/containers-roadmap/issues/1810. Hence me adding myself to this thread before being able to look closer into this. I just wanted to loose track of this here.

As it now turns out, after looking more into this yesterday, we see that we get this problems only while enabling/associating an oidc identity provider for the cluster. After the enabling process has finished the error msgs disappear and network connectivity is restored. At least from some few initial tests yesterday.

Sorry for causing confusion.

The issue is not present in at least 1.11.{7,8,9}

@recollir could you be a bit more specific in terms of what you saw on 1.10.16? So far we’ve been working with the assumption that problematic behaviour (i.e. cilium removing identities/endpoints/etc, as pointed out in https://github.com/cilium/cilium/issues/20915#issuecomment-1219363906) has been introduced in 1.12.


In general, it would be helpful if affected parties could tell us which symptoms they are experiencing, we are currently differentiating between the following:

  1. Scale events of the EKS control plane leading to prolonged (~5-10m) error logging around 401 Unauthorized. This happens because EKS adds new endpoints for its kube API server before it removes the old ones, but the cilium operator stays around and tries to talk to the stale endpoints. Once the old endpoints are gone, the connections fail and the operator starts working again, using the new API servers.
  2. Removing identities/endpoints/… (as reported by @mhulscher), as a result of being unable to communicate with the kube api server. This is problematic behaviour and causes outages. So far, we are thinking that this was introduced in 1.12 as it has been reported that 1.11 does not exhibit this behaviour, and are investigating why it occurs.

As a note, 1. is mitigated by https://github.com/cilium/cilium/pull/22393.

@gandro perhaps you could consider creating an EKS 1.23 cluster w/ cilium 1.12, then upgrading EKS to 1.24. This will cause the kube-apiservers to be replaced.