kubernetes: kube-proxy health check doesn't fail when kube-proxy can't acquire iptables lock
What happened?
We upgraded iptables on one of our hosts, and kube-proxy began logging errors about being unable to acquire the xtables.lock
. However, we didn’t notice this, because kube-proxy’s health check (/healthz
) was still passing (and that’s generally what we look at as our sign that something is amiss.) As a result, kube-proxy ran for days without being able to update iptables rules, eventually causing an outage for one of our services.
What did you expect to happen?
We expected that kube-proxy’s health check would fail if it was not able to interface with iptables.
How can we reproduce it (as minimally and precisely as possible)?
Our system was running Debian stretch, running kube-proxy in iptables mode with systemd, with iptables version 1.6.0. Given that starting configuration, the repro case is:
- Upgrade iptables to 1.6.2.
- Observe that kube-proxy begins emitting logs saying that it cannot update iptables because another process has the
xtables.lock
. - Observe that kube-proxy’s health check endpoint (
/healthz
) does not return an error.
Anything else we need to know?
Example health check output:
$ curl -vvv localhost:10256/healthz
* Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 10256 (#0)
> GET /healthz HTTP/1.1
> Host: localhost:10256
> User-Agent: curl/7.52.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: application/json
< X-Content-Type-Options: nosniff
< Date: Tue, 19 Oct 2021 22:21:06 GMT
< Content-Length: 149
<
* Curl_http_done: called premature == 0
* Connection #0 to host localhost left intact
{"lastUpdated": "2021-10-19 22:21:06.566660851 +0000 UTC m=+219.246054255","currentTime": "2021-10-19 22:21:06.566660851 +0000 UTC m=+219.246054255"}
Kubernetes version
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.11", GitCommit:"27522a29febbcc4badac257763044d0d90c11abd", GitTreeState:"clean", BuildDate:"2021-09-15T19:21:44Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.11", GitCommit:"27522a29febbcc4badac257763044d0d90c11abd", GitTreeState:"clean", BuildDate:"2021-09-15T19:16:25Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider
None
OS version
$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 9 (stretch)"
NAME="Debian GNU/Linux"
VERSION_ID="9"
VERSION="9 (stretch)"
VERSION_CODENAME=stretch
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
$ uname -a
Linux myhost 5.4.38 #1~bt9+1 SMP Tue May 12 21:16:19 UTC 2020 x86_64 GNU/Linux
Install tools
The hard way
Container runtime (CRI) and and version (if applicable)
Docker version 19.03.8, build afacb8b7f0
Related plugins (CNI, CSI, …) and versions (if applicable)
Calico v3.18.4
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 20 (19 by maintainers)
Oh! That’s wrong. We don’t care about the time of the most-recently queued update, we care about the time of the oldest queued update that was after the last
Updated
. So rather than havinglastQueued
and updating it every timeQueuedUpdate
gets called, we should have a variable something likeoldestPendingUpdate
, andUpdated
can set it to 0, andQueuedUpdate
should only set it toNow()
if it was previously 0.In fact, we don’t even need
lastUpdated
. Kube-proxy would be healthy ifoldestPendingUpdate.IsZero()
orcurrentTime.Sub(oldestPendingUpdate) < h.hs.healthTimeout
.