kubernetes: kube-proxy currently incompatible with `iptables >= 1.8`

What happened:

When creating nodes on machines with iptables >= 1.8 kube-proxy is unable initialize and route service traffic. The following is logged:

kube-proxy-22hmk kube-proxy E1120 07:08:50.135017       1 proxier.go:647] Failed to ensure that nat chain KUBE-SERVICES exists: error creating chain "KUBE-SERVICES": exit status 3: iptables v1.6.0: can't initialize iptables table `nat': Table does not exist (do you need to insmod?)
kube-proxy-22hmk kube-proxy Perhaps iptables or your kernel needs to be upgraded.

This is compat issue in iptables which I believe is called directly from kube-proxy. This is likely due to module reorganization with iptables move to nf_tables: https://marc.info/?l=netfilter&m=154028964211233&w=2

iptables 1.8 is backwards compatible with iptables 1.6 modules:

root@vm77:~# iptables --version
iptables v1.6.1
root@vm77:~# docker run --cap-add=NET_ADMIN drags/iptables:1.6 iptables -t nat -Ln
iptables: No chain/target/match by that name.
root@vm77:~# docker run --cap-add=NET_ADMIN drags/iptables:1.8 iptables -t nat -Ln
iptables: No chain/target/match by that name.



root@vm83:~# iptables --version
iptables v1.8.1 (nf_tables)
root@vm83:~# docker run --cap-add=NET_ADMIN drags/iptables:1.6 iptables -t nat -Ln
iptables v1.6.0: can't initialize iptables table `nat': Table does not exist (do you need to insmod?)
Perhaps iptables or your kernel needs to be upgraded.
root@vm83:~# docker run --cap-add=NET_ADMIN drags/iptables:1.8 iptables -t nat -Ln
iptables: No chain/target/match by that name.

However kube-proxy is based off of debian:stretch which iptables-1.8 may only make it to as part of stretch-backports

How to reproduce it (as minimally and precisely as possible):

Install a node onto a host with iptables-1.8 installed (ex: Debian Testing/Buster)

Anything else we need to know?:

I can keep these nodes in this config for a while, feel free to ask for any helpful output.

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.2", GitCommit:"17c77c7898218073f14c8d573582e8d2313dc740", GitTreeState:"clean", BuildDate:"2018-10-24T06:54:59Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.4", GitCommit:"bf9a868e8ea3d3a8fa53cbb22f566771b3f8068b", GitTreeState:"clean", BuildDate:"2018-10-25T19:06:30Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}```
  • Cloud provider or hardware configuration:

libvirt

  • OS (e.g. from /etc/os-release):
PRETTY_NAME="Debian GNU/Linux buster/sid"
NAME="Debian GNU/Linux"
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
  • Kernel (e.g. uname -a):
Linux vm28 4.16.0-1-amd64 #1 SMP Debian 4.16.5-1 (2018-04-29) x86_64 GNU/Linux
  • Install tools:

kubeadm

  • Others:

/kind bug

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 82 (61 by maintainers)

Commits related to this issue

Most upvoted comments

this works for me update-alternatives --set iptables /usr/sbin/iptables-legacy

There are 2 sets of modules for packet filtering in the kernel: ip_tables, and nf_tables. Until recently, you controlled the ip_tables ruleset with the iptables family of tools, and nf_tables with the nft tools.

In iptables 1.8, the maintainers have “deprecated” the classic ip_tables: the iptables tool now does userspace translation from the legacy UI/UX, and uses nf_tables under the hood. So, the commands look and feel the same, but they’re now programming a different kernel subsystem.

The problem arises when you mix and match invocations of iptables 1.6 (the previous stable) and 1.8 on the same machine, because although they look identical, they’re programming different kernel subsystems. The problem is that at least Docker does some stuff with iptables on the host (uncontained), and so you end up with some rules in nf_tables and some rules (including those programmed by kube-proxy and most CNI addons) in legacy ip_tables.

Empirically, this causes weird and wonderful things to happen - things like if you trace a packet coming from a pod, you see it flowing through both ip_tables and nf_tables, but even if both accept the packet, it then vanishes entirely and never gets forwarded (this is the failure mode I reported to Calico and Weave - bug links upthread - after trying to run k8s on debian testing, which now has iptables 1.8 on the host).

Bottom line, the networking containers on a machine have to be using the same minor version of the iptables binary as exists on the host.

As a preface, one thing to note: iptables 1.8 ships two binaries, iptables and iptables-legacy. The latter always programs ip_tables. So, there’s fortunately no need to bundle two versions of iptables into a container, you can bundle just iptables 1.8 and be judicious about which binary you invoke… At least until the -legacy binary gets deleted, presumably in a future release.

Here’s some requirements I think an ideal solution would have:

  • k8s networking must continue to function, obviously.
  • should be robust to the host iptables getting upgraded while the system is running (e.g. apt-get upgrade in the background).
  • should be robust to other k8s pods (e.g. CNI addons) using the “wrong” version of iptables.
  • should be invisible to cluster operators - k8s should just keep working throughout.
  • should not require a “flag day” on which everything must cut over simultaneously. There’s too many things in k8s that touch iptables (docker, kube-proxy, CNI addons) to enforce that sanely, and k8s’s eventual consistency model doesn’t make a hard cutover without downtime possible anyway.
  • at the very least, the problem should be detected and surfaced as a fatal node misconfiguration, so that any automatic cluster healing can attempt to help.

So far I’ve only thought up crappy options for dealing with this. I’ll throw them out in the hopes that it leads to better ideas.

  • Mount chunks of the host filesystem (/usr/sbin, /lib, …) into kube-proxy’s VFS, and make it chroot() to that quasi-host-fs when executing iptables commands. That way it’s always using exactly the binary present on the host. Introduces obvious complexity, as well as a bunch of security risks if an attacker gets code execution in the kube-proxy container.
  • Using iptables 1.8 in the container, probe both iptables and iptables-legacy for the presence of rules installed by the host. Hopefully, there will be rules in only one of the two, and that can tell kube-proxy which one to use. This is subject to race conditions, and is fragile to host mutations that happen after kube-proxy startup (e.g. apt-get upgrade that upgrades iptables and restarts the docker daemon, shifting its rules over to nf_tables). Can solve it with periodic reconciling (i.e. “oops, host seems to have switched to nf_tables, wipe all ip_tables rules and reinstall them in nf_tables!”)
  • Punt the problem up to kubeadm and an entry in the KubeProxyConfiguration cluster object. IOW, just document that “it’s your responsibility to correctly tell kube-proxy which version of iptables you’re using, or things will break.” Relies on humans to get things right, which I predict will cause a rash of broken clusters. If we do this, we should absolutely also wire something into node-problem-detector that fires when both ip_tables and nf_tables have rules programmed.
  • Have a cutover release in which kube-proxy starts using nf_tables exclusively, through the nft tools, and mandate that host OSes for k8s must do everything in nf_tables, no ip_tables allowed. Likely intractable given the variety of addons and non-k8s software that does stuff to the firewall (same reason iptables has endured all these years even though nftables is measurably better in every way).
  • Find some kernel hackers and ask them if there’s any way to make ip_tables and nf_tables play nicer together, so that userspace can just continue tolerating mismatches indefinitely. I’m assuming this is ~impossible, otherwise they’d have done it already to facilitate the transition to nf_tables.
  • Create a new DaemonSet whose sole purpose is to be an RPC-to-iptables translator, and get all iptables-using pods in k8s to use it instead of talking direct to the kernel. Clunky, expensive, and doesn’t solve the problem of host software touching stuff.
  • Just document (via a Sonobuoy conformance test) that this is a big bag of knives, and kick the can over to cluster operators to figure out how to safely upgrade k8s in place given these constraints. I can at least speak on behalf of GKE and say that I sure hope it doesn’t come to that, because all our options are strictly worse. I can also speak as the author of MetalLB and say that the support load from people with broken on-prem installs will be completely unsustainable for me 😃

Of all of these, I think “probe with both binaries and try to conform to whatever is already there” is the most tractable if kube-proxy were the only problem pod… But given the ecosystem of CNI addons and other third-party things, I foresee never ending duels of controllers flapping between ip_tables and nf_tables endlessly, all trying to vaguely converge on a single stack, but never succeeding.

I can confirm that updating the host to use iptables-legacy works on Raspbian 10 (arm) and Debian 10 (amd64) to resolve the iptables mismatch issue.

For completeness it may be beneficial to update all of the network tools to use the legacy versions to avoid issues. These commands may or may not be pertinent depending upon specific host configuration but will avoid mixing legacy and nft modes if invoked from outside docker/kubernetes.

update-alternatives --set iptables /usr/sbin/iptables-legacy update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy update-alternatives --set arptables /usr/sbin/arptables-legacy update-alternatives --set ebtables /usr/sbin/ebtables-legacy

@danwinship @thockin Looks like it would be good to document any workarounds for 1.16 release notes since folks are hitting this already? (see https://github.com/kubernetes/kubernetes/issues/82361 for example)

The official kubernetes packages, and in particular kubeadm-based installs, are fixed as of 1.17. Other distributions of kubernetes may have been fixed earlier or might not be fixed yet.

When using nf_tables mode rules are added indefinitely to the KUBE-FIREWALL chain;

Chain KUBE-FIREWALL (2 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x8000/0x8000 /* kubernetes firewall for dropping marked packets */
    0     0 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x8000/0x8000 /* kubernetes firewall for dropping marked packets */
    0     0 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x8000/0x8000 /* kubernetes firewall for dropping marked packets */
    0     0 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x8000/0x8000 /* kubernetes firewall for dropping marked packets */
....

in both proxy-mode ipvs and iptables.

Kind of a creepy idea, but you could use nsenter to run the iptables command on the host in the hosts environment.

kube-proxy itself seems compatible with iptables >=1.8 so the slogan in this issue is somewhat misleading. I have made basic tests and see no problems when using the correct version of the user-space iptables (and ipv6 with ip6tables) and the supporting libs. I don’t think this problem can be fixed by altering some code in kube-proxy.

Tested versions; iptables v1.8.2, linux 4.19.3

The problem seems to be that that iptables user-space program (and libs) is (and has always been) dependent on the kernel version on the host. When the iptables user-space program is in a container with a old version this problem is bound to happen sooner or later, and it will happen again.

The kernel/user-space dependency is one of the problem that nft is supposed to fix. A long-term solution may be to replace iptables with ntf or bpf.

To get rid of that libvirt error, my permanent workaround in Debian 11 (as a host) with libvirtd daemon is to block the loading of iptables-related modules:

Create a file in /etc/modprobe.d/nft-only.conf:


#  Source: https://www.gaelanlloyd.com/blog/migrating-debian-buster-from-iptables-to-nftables/
#
blacklist x_tables
blacklist iptable_nat
blacklist iptable_raw
blacklist iptable_mangle
blacklist iptable_filter
blacklist ip_tables
blacklist ipt_MASQUERADE
blacklist ip6table_nat
blacklist ip6table_raw
blacklist ip6table_mangle
blacklist ip6table_filter
blacklist ip6_tables

libvirtd daemon now starts without any error.

Post-analysis: Apparently, I had iptables module loaded alongside with many nft-related modules; once iptables was gone, the pesky error message went away.

We should be able to give containers a working set of iptables-legacy or iptables-nft binaries directly rather than needing a proxy. Just give them an entire chroot rather than just the binaries. (ie, build a Debian container image containing only the iptables package and the packages it depends on (eg, glibc), and then mount that somewhere in the pod). Then instead of overwriting their /usr/sbin/iptables with a proxy binary, you overwrite it with a shell script that does chroot /iptables-binary-volume-sadkjf -- iptables "$@", etc. Or that works with the hostBinaries volume idea too; the volume would just contain the chroot within it in addition to the wrapper scripts.

NB: We’d need every single container that uses iptables to participate in this…

Every single container that uses iptables in the root network namespace. It’s fine for, eg, istio, to use whatever iptables mode it wants in the pod namespace. (Though if you have multiple sidecars in a pod they all need to use the same mode…)

We’d also need some EOL plan - when can we stop doing this?

Probably as long as we care about people running Kubernetes on RHEL/CentOS 7. (People will probably be running RHEL 7 longer than people are running CentOS 7, but we might care about those users less. Either way, by the time we stop caring about that, everyone else should be using nft mode.)

I don’t think we need to handle non-privileged pods (or without the caps we care about) - they should not be able to use iptables anyway, but we should triple check that.

That is correct. Pods need to be hostNetwork and either privileged or CAP_NET_ADMIN for them to matter.

  • the host may not have iptables or it may be installed without optional packages (debian split iptables and iptables v6 IIRC and they probably put all the kernel modules in individual packages because they like to be fine-grained)

Your main point stands, but the Debian (and Ubuntu, and …) packages aren’t fine-grained: iptables contains all the (arp|ep|ip|ip6|x)tables tools, and the kernel package contains all the modules. There are package splits in the iptables source, but they split out library packages, not tool packages, and the iptables package depends on them all anyway.

@thockin Debian Buster is the main one (as Debian is used as the default distro by many of the k8s components), Ubuntu 19.04, RHEL 8 (and the upcoming Centos 8 by extension), Alpine 3.10, Fedora >= 29.

A built-in “figure it out” mode seems right. This is frankly ridiculous. This is what APIs are for, and like it or not exec iptables is an API. Forcing all parties to coordinate and use the same binaries is ridonculous and clearly not workable.

It sounds like openshift has implemented a “figure it out” mode on its own. But I know many customers who are not going to be happy hostPath mounting / into kube-proxy. Do we REALLY need the host’s binaries or can we install iptables 1.8+ and call our own iptables.sh which does the same detection?

What distros are known to have 1.8 available so I can do some playing?

The two modes are supposed to be equivalent in terms of behavior. (The advantage of using iptables in nft mode is that it lets other parts of the system use nft directly, and get nft’s advantages, and their rules will interoperate correctly with the iptables-nft rules. Whereas if you use iptables-legacy, the iptables rules and nft rules would conflict with each other in complicated ways.)

So anyway, the two modes are supposed to be equivalent, so we shouldn’t have to test against both modes, and if we did, and something in kubernetes didn’t work right in one mode, that would indicate an iptables or kernel bug, not a kubernetes bug. It’s possible we might end up wanting to add workarounds to kubernetes for a bug in one or the other mode at some point, if someone discovers such a bug, but I don’t think we need to be testing against both modes continuously.

(And in practice, OCP on RHEL 8 using nft mode works just fine, other than possibly one problem with a -j REJECT mysteriously not actually rejecting and behaving like it was -j DROP.)

Our approach in OpenShift is to have the relevant pods mount the entire host filesystem, and in the corresponding image we install wrapper scripts in /usr/sbin that chroot to the host filesystem and exec the copy of iptables there.

These images then work on any system regardless of whether it has old or new iptables, and in the latter case, whether that iptables is configured to use “legacy” or “nft” mode. (In particular, these images work on both RHEL 7, using legacy iptables, and RHEL 8, using new iptables in nft mode.)

But I’m thinking we should get iptables upstream to add a new “figure it out” mode to the client binaries, which would internally do something along the lines of what @dcbw suggested above to figure out if the system iptables was using nft mode or legacy mode, and then it would just use the same mode. Then we just tell everyone “make sure your containers are using the iptables-for-containers package from iptables version 1.8.whatever or later” and they don’t have to worry beyond that.

Just wanted to post another confirmation here that when installing / running a K8s cluster on Raspberry Pis with Raspbian 10 / Buster, I had to run:

update-alternatives --set iptables /usr/sbin/iptables-legacy

Otherwise I was getting lots of errors with networking from various non-k8s-core pods (e.g. coredns, ingress, kube-proxy, kube-apiserver were seemingly fine, but flannel, metrics-server, nfs-client-provisioner were crashlooping).

I ran the above command on each node and rebooted all nodes, and everything quickly switched to Running status.

@danwinship the symlink itself will actually tell you what the binary is. Following the ‘iptables’ symlink will either:

  1. be a direct symlink to iptables-legacy or iptables-nft
case $(readlink /sbin/iptables) in
xtables-legacy-multi|iptables-legacy)
      echo "legacy"
      ;;
xtables-nft-multi|iptables-nft)
      echo "nft"
      ;;
esac
  1. on systems that use ‘alternatives’ (because hey Linux is all about CHOICE! right???) we can call alternatives to tell us (but I’m not sure exactly what alternatives does underneath):
case $(alternatives --list | awk '/^iptables /{print $3}') in
*iptables-legacy|*xtables-legacy-multi)
      echo "legacy"
      ;;
*iptables-nft|*xtables-nft-multi)
      echo "nft"
      ;;
esac

Even that’s a bit more complicated. Possibly the best choice here is to simply accept that if a container wants to modify the host OS then it may need to run tools provided by the host OS and not blindly assume that stuff it ships internally can always be used. Mount the host bin/lib/etc into /host and then have your internal /usr/sbin/iptables be a chroot wrapper into those dirs so that when kube-proxy calls iptables it actually runs the wrapper and does the right thing.

I remain unconvinced that containers that wish to modify the host OS can just blindly go about whatever they want to do and assume the host OS doesn’t matter.

I experienced the same issue in #72370. As a workaround I found this in the oracle docs, which made the pods be able to communicate with each other as well as with the outside world again.