cilium: Misbehaving L7Proxy makes Cilium API Unable to create new endpoint
Bug report
General Information
-
Cilium version: Client: 1.10.0 952d9d3 2021-05-19T18:42:32+02:00 go version go1.16.4 linux/amd64 Daemon: 1.10.0 952d9d3 2021-05-19T18:42:32+02:00 go version go1.16.4 linux/amd64
-
Kernel version (run
uname -a) Linux itsedcll-bb4070 5.10.0-1026-oem #27-Ubuntu SMP Thu May 6 07:19:13 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux -
Orchestration system version in use (e.g.
kubectl version, …) Client Version: version.Info{Major:“1”, Minor:“20”, GitVersion:“v1.20.6”, GitCommit:“8a62859e515889f07e3e3be6a1080413f17cf2c3”, GitTreeState:“clean”, BuildDate:“2021-04-15T03:28:42Z”, GoVersion:“go1.15.10”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“20”, GitVersion:“v1.20.6”, GitCommit:“8a62859e515889f07e3e3be6a1080413f17cf2c3”, GitTreeState:“clean”, BuildDate:“2021-04-15T03:19:55Z”, GoVersion:“go1.15.10”, Compiler:“gc”, Platform:“linux/amd64”} -
Link to relevant artifacts (policies, deployments scripts, …)
helm upgrade -i cilium cilium/cilium \
--version 1.10.0 \
--namespace kube-system \
--set kubeProxyReplacement=strict \
--set k8sServiceHost=k8s-api.dcll.cnp.ikeadt.com \
--set k8sServicePort=6443 \
--set localRedirectPolicy=true \
--set prometheus.enabled=true \
--set operator.prometheus.enabled=true \
--set hubble.enabled=true \
--set hubble.metrics.enabled="{dns,drop,tcp,flow,port-distribution,icmp,http}" \
--set hubble.relay.enabled=true \
--set hubble.ui.enabled=true \
--set containerRuntime.integration=containerd \
--set tunnel=disabled \
--set ipam.mode=kubernetes \
--set nativeRoutingCIDR=10.62.128.0/19 \
--set devices=client0 \
--set endpointRoutes.enabled=true
- Generate and upload a system zip: I have run it and tried to upload it to this issue but with out any luck. Maybe it is to large…
How to reproduce the issue
- Install cilium with the above helm command
- run
cilium connectivity test&&kubectl delete ns cilium-testa few times then after a few runs Cilium API becomes unavailable. And pods fails to be scheduled because of Cilium API timetouts
Warning FailedCreatePodSandBox 46m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "2788700c30f493be1d6c729fc54d79700dbe48e7e1160a046710c9479b248a6a": Unable to create endpoint: Cilium API client timeout exceeded
At this point in time I am able to curl to 2 of 6 of the cilium-agent endpoints:
curl http://10.62.64.21:9095 -i
HTTP/1.1 404 Not Found
date: Mon, 31 May 2021 11:54:34 GMT
server: envoy
content-length: 0
On the working nodes I cannot curl the API:
curl http://10.62.64.23:9095 -i
curl: (7) Failed to connect to 10.62.64.23 port 9095: Connection refused
If I delete the pods on the broken nodes, they come back online for a little while and then the become healthy again.
The cilium status reports all healthy
The same is if I run cilium status in the individual pods they also report healthy
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 51 (48 by maintainers)
Had another look at this. My current theory is that we are stuck in: https://github.com/cilium/cilium/blob/952d9d33740fc77e1a94d5a52b38d2a313e5c570/pkg/endpoint/bpf.go#L671
@knfoo one way to verify this would be to enable debugging output (there is a
-Doption in the agent, adebug.enabledvalue that can be set in helm:--set debug.enabled=true, and for cilium-cli a config map option calleddebug: true)With debugging set, we should be able to the following messages: https://github.com/cilium/cilium/blob/952d9d33740fc77e1a94d5a52b38d2a313e5c570/pkg/endpoint/endpoint.go#L427-L432
details
In the sysdump, if we track the creation of a endpoint
2736, it starts with:And the final message for this endpoint is:
Generated from: https://github.com/cilium/cilium/blob/952d9d33740fc77e1a94d5a52b38d2a313e5c570/pkg/endpoint/bpf.go#L729
which means that
realizeBPFState()will return and reachwaitForProxyCompletions: https://github.com/cilium/cilium/blob/952d9d33740fc77e1a94d5a52b38d2a313e5c570/pkg/endpoint/bpf.go#L641-L671Indeed, there are a couple of goroutines stack there in the gops-stack logs:
If L7 proxy is causing the CNI ADD call to timeout, that is indeed interesting, thanks for the update, we’ll continue to investigate that.
FWIW, you should be able to disable all
to-worldtests in the connectivity check as well:Looking at the sysdumps, it seems that you are indeed not hitting the proxy issue anymore. Do you have particular connectivity tests that are failing? Do the issues persist in Cilium 1.10.2?
The
kubectl logserror often occurs if there are node connectivity issues (i.e. apiserver not being able to reach kubelet on the node). Did you observe issues between particular nodes?Might be worth creating a new issue with the new issues, so we can keep track of them separately, as they seem unrelated to proxy.
I have run in again with the settings in the cilium-config as asked.
I noticed this as well in one of my test runs. I have attached the sysdump and the output of the test.
cilium-test-06.md cilium-sysdump-20210701-141236.zip
@knfoo We’re trying to reproduce the issue in order to be able to debug it further, but currently with no success.
If you still have your test cluster, it would be incredibly helpful if we can gather more information with flow debug logging enabled:
debug: "true"anddebug-verbose: "flow"to thecilium-configConfigMapThanks!
The logs seem to verify that https://github.com/cilium/cilium/blob/952d9d33740fc77e1a94d5a52b38d2a313e5c570/pkg/endpoint/bpf.go#L671
never returns.
For example, for
309ac26cc3d7eb1727f97cf96197cc9e9387512234176df950d1bfe8f5460b13:corresponding to endpoint
40, there is an entry:In the logs, but no
Wait time for proxy updates:matching message.Not sure why - but adding
--set debug.enabled=truedid something with the predictability of the failing test. So it took a little while extra to figure out how to provoke the error state.cilium-sysdump-20210623-193409.zip cilium-sysdump-20210623-193436.zip
Because going over the GH issue can be a bit confusing, I think it would be value to post a summary of our current understanding after synching with @knfoo offline.
While performing connectivity tests, @knfoo discovered that in some cases certain nodes will reach a state where pod creation will fail. This seem to happen when the envoy proxy starts in said nodes [1] (if L7 tests are disabled, the issue does not happen). That is, in some cases starting envoy seems to obstruct pod creation.
In the latest sysdump, in
k8s-events-20210615-154339.json, there are the following failures:All of these seem to appear to be scheduled on the same agent:
Which is also the only agent that starts envoy:
The issue does not happen every time envoy starts, but it can be consistently reproduced after doing multiple retries (last example was 1 attempt out of 8 but this might vary).
The pod creating request seem to reach the agent, but it’s not clear what happens next.
In the same sysdump, we see the following error:
Respective agent logs show that the above request reached the agent:
I reran the test with
sleep 5mthat did not make a difference.I can update the issue title - but I am not really sure what to update it to.
This is in the when L7Proxy is misbehaving…
So I am a little confused how you got to the statement that:
Thank you for verifying this.
Next, I would like to understand the underlying error when the L7 proxy is enabled.
This previous comment, for example, mentions that the pods failed to reach a ready state after the third attempt. Am I correct in that the above comment refers to the kubeproxy case? Are pods failing to reach a ready state for the
kubeProxyReplacement=strictcase also?I would, also, like to understand whether the issue relates to having multiple tests in succession.
For the case(s) that there are pod readiness issues on the connectivity test, does the issue happen if you wait (e.g., using a
sleep 5m) after deleting the namespace?If adding a sufficient large delay does not cause pods readiness issues, can you also please run the tests in different namespaces, with something like the following and check whether the issue appears again?
I reran the test 50 delete ns and
cilium connectivity testthey all passed so I guess thatwas a glitch.
I ran the test with the kubeproxy replacement
I ran the test 50 delete ns and
cilium connectivity testcycles and no errors. I guess it confirms that it has to do with the L7 proxy.We had a closer look at the logs with @gandro.
One thing we would like to note is that the 9095 port is not the cilium agent API port. It is a port that is used by envoy to serve Prometheus metrics. Envoy is started when a L7 policy is imported.
For example, in the
cilium-4dnrn-cilium-agent-20210615-154347.log, the following log entry:is followed by Envoy start to listen on this port:
In other agents, where there are no relevant L7 policies, the above message about envoy listening to 9095 does not exist.
To help pin down the issue, would it be possible to install cilium using
--set l7Proxy=falseso that envoy is disabled. And then run non-L7 connectivity tests using:cilium connectivity test --test '!pod-to-world-toFQDNs'?We will continue investigating this, but it would helpful if we can rule out our theory that this relates to L7 proxy is correct.
It seems to be the same problem. During the 3rd
cilium connectivity testthe pods failed to get to a ready state:Warning Unhealthy 73s (x49 over 2m49s) kubelet Readiness probe failed: Get "http://10.62.129.125:8080/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)Canceled the test and deleted the namespace. then we are back to:Warning FailedCreatePodSandBox 6s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "eaa76446cfab3bb4b8edb7cbdffdb330e0abdf86384004775d18d372be56af71": Unable to create endpoint: Cilium API client timeout exceededSysdump: cilium-sysdump-20210615-154339.zip
Could you also remove those since you’re now using kube-proxy?
I installed cilium with this command.
disabling the bandwidth manager did not change the problem. output from the
cilium connectivity testthe world errors are expected - we do not have opened up to all of the world in our border firewalls. test-2.mdSysdump from the broken nodes. cilium-sysdump-20210611-111109.zip cilium-sysdump-20210611-111359.zip
I will recreate the cluster with out disabling kube-proxy
Those should be fine AFAIK.
From dmesg this is what I have that is recurring
I will try to first run cilium with
badnwidthManager=falseand then disable thekubeProxyReplacementtomorrowThe cluster and nodes are idle. I have only run
cilium connectivity testThe other deployments are static no restarts, scaling etc.cilium-sysdump-20210604-132435.zip cilium-sysdump-20210604-132759.zip
I have run the sysdump command on a working node as well - but even I specify the
--size-limitthe file is to large to upload to GH it is 36.3Mb