gvisor: DNS fails on gVisor using netstack on EKS
Description
I’m deploying Pods on my EKS cluster using the gVisor runtime, however the outbound network requests fail while inbound requests succeed. The issue is mitigated when using network=host
in the runsc config options.
Steps to reproduce
- I created a 2 node EKS cluster and configured a node to use
conatinerd
as a container CRI and configured the gVisor runtime with containerd (following this tutorial). I also labeled the node I selected for gVisor withapp=gvisor
.
EKS Cluster Nodes: (you can see the first node using containerd as it’s container runtime)
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-192-168-31-136.us-west-2.compute.internal Ready <none> 3d1h v1.16.12-eks-904af05 192.168.31.136 35.161.102.17 Amazon Linux 2 4.14.181-142.260.amzn2.x86_64 containerd://1.3.2
ip-192-168-60-139.us-west-2.compute.internal Ready <none> 3d1h v1.16.12-eks-904af05 192.168.60.139 44.230.198.56 Amazon Linux 2 4.14.181-142.260.amzn2.x86_64 docker://19.3.6
runsc
config on gVisor node:
[ec2-user@ip-192-168-31-136 ~]$ ls /etc/containerd/
config.toml runsc.toml
[ec2-user@ip-192-168-31-136 ~]$ cat /etc/containerd/config.toml
disabled_plugins = ["restart"]
[plugins.linux]
shim_debug = true
[plugins.cri.containerd.runtimes.runsc]
runtime_type = "io.containerd.runsc.v1"
[plugins.cri.containerd.runtimes.runsc.options]
TypeUrl = "io.containerd.runsc.v1.options"
ConfigPath = "/etc/containerd/runsc.toml"
[ec2-user@ip-192-168-31-136 ~]$ cat /etc/containerd/runsc.toml
[runsc_config]
debug="true"
strace="true"
log-packets="true"
debug-log="/tmp/runsc/%ID%/"
- I applied a gVisor runtime class to my cluster:
cat << EOF | tee gvisor-runtime.yaml
apiVersion: node.k8s.io/v1beta1
kind: RuntimeClass
metadata:
name: gvisor
handler: runsc
EOF
kubectl apply -f gvisor-runtime.yaml
- And ran a simple nginx Pod using the gvisor runtime:
cat << EOF | tee nginx-gvisor.yaml
apiVersion: v1
kind: Pod
metadata:
name: nginx-gvisor
spec:
containers:
- name: my-nginx
image: nginx
ports:
- containerPort: 80
nodeSelector:
app: gvisor
runtimeClassName: gvisor
EOF
kubectl create -f nginx-gvisor.yaml
To verify the Pod is running with gVisor:
# Get the container ID
kubectl get pod nginx-gvisor -o jsonpath='{.status.containerStatuses[0].containerID}'
containerd://9f71a133fc27c3a305710552489c16977d5c48cd40f31810c2010dac393c5ba7%
# List conatienrs running with runsc on gVisor node
[ec2-user@ip-192-168-31-136 gvisor]$ sudo env "PATH=$PATH" runsc --root /run/containerd/runsc/k8s.io list -quiet
9411dfee3811da9dd45e8681f697bcf5326173d6510238ce70beb02ffe00f444
9f71a133fc27c3a305710552489c16977d5c48cd40f31810c2010dac393c5ba7
- To test the inbound network traffic of the Pod, I simply
curl
ed port 80 of the Pod and it succeeded. To test the outbound network traffic of the Pod, I did the following:
kubectl exec --stdin --tty nginx-gvisor -- /bin/bash
root@nginx-gvisor:/# apt-get update
Err:1 http://security.debian.org/debian-security buster/updates InRelease
Temporary failure resolving 'security.debian.org'
Err:2 http://deb.debian.org/debian buster InRelease
Temporary failure resolving 'deb.debian.org'
Err:3 http://deb.debian.org/debian buster-updates InRelease
Temporary failure resolving 'deb.debian.org'
Reading package lists... Done
W: Failed to fetch http://deb.debian.org/debian/dists/buster/InRelease Temporary failure resolving 'deb.debian.org'
W: Failed to fetch http://security.debian.org/debian-security/dists/buster/updates/InRelease Temporary failure resolving 'security.debian.org'
W: Failed to fetch http://deb.debian.org/debian/dists/buster-updates/InRelease Temporary failure resolving 'deb.debian.org'
W: Some index files failed to download. They have been ignored, or old ones used instead.
You can see that it fails. Other attempts such as wget www.google.com
fail as well.
For debug purposes, these are the DNS and routing tables (without net-tools, since I couldn’t install them) in the Pod container:
root@nginx-gvisor:/# cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local us-west-2.compute.internal
nameserver 10.100.0.10
options ndots:5
root@nginx-gvisor:/# cat /proc/net/route
Iface Destination Gateway Flags RefCnt Use Metric Mask MTU Window IRTT
eth0 0101FEA9 00000000 0001 0 0 0 FFFFFFFF 0 0 0
eth0 00000000 0101FEA9 0003 0 0 0 00000000 0 0 0
I also captured the tcpdump packets on the ENI network interface for the Pod allocated by EKS: eni567d651201a.nohost.tcpdump.tar.gz. Details about the network interface:
[ec2-user@ip-192-168-31-136 ~]$ ifconfig
eni567d651201a: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9001
inet6 fe80::4cfa:44ff:fe5d:9495 prefixlen 64 scopeid 0x20<link>
ether 4e:fa:44:5d:94:95 txqueuelen 0 (Ethernet)
RX packets 3 bytes 270 (270.0 B)
RX errors 0 dropped 2 overruns 0 frame 0
TX packets 5 bytes 446 (446.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
I also captured runsc
debug information for the containers in the Pod:
9f71a133fc27c3a305710552489c16977d5c48cd40f31810c2010dac393c5ba7.tar.gz
9411dfee3811da9dd45e8681f697bcf5326173d6510238ce70beb02ffe00f444.tar.gz
- Now to verify that it works when the Pod is using the host network, I added
network="host"
to the/etc/containerd/runsc.toml
file and restarted containerd. I reran the same experiment above with the following results:
Verify running Pod:
# Get the container ID
kubectl get pod nginx-gvisor -o jsonpath='{.status.containerStatuses[0].containerID}'
containerd://e4ec52fdad3e889bf386b1eca03e231ad53e0452e4bc623282732eba0d2da720%
# List conatienrs running with runsc on gVisor node
[ec2-user@ip-192-168-31-136 gvisor]$ sudo env "PATH=$PATH" runsc --root /run/containerd/runsc/k8s.io list -quiet
96198907b56174067a1aa2b9c0fa3644670675b25fa28a7b44234fc232cccd5d
e4ec52fdad3e889bf386b1eca03e231ad53e0452e4bc623282732eba0d2da720
Successful inbound with curl
, and successful outbound as follows:
kubectl exec --stdin --tty nginx-gvisor -- /bin/bash
root@nginx-gvisor:/# apt-get update
Get:1 http://security.debian.org/debian-security buster/updates InRelease [65.4 kB]
Get:2 http://security.debian.org/debian-security buster/updates/main amd64 Packages [213 kB]
Get:3 http://deb.debian.org/debian buster InRelease [121 kB]
Get:4 http://deb.debian.org/debian buster-updates InRelease [51.9 kB]
Get:5 http://deb.debian.org/debian buster/main amd64 Packages [7905 kB]
Get:6 http://deb.debian.org/debian buster-updates/main amd64 Packages [7868 B]
Fetched 8364 kB in 6s (1462 kB/s)
Reading package lists... Done
DNS and routing table (with net-tools this time) on Pod:
root@nginx-gvisor:/# cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local us-west-2.compute.internal
nameserver 10.100.0.10
options ndots:5
root@nginx-gvisor:/# cat /proc/net/route
Iface Destination Gateway Flags RefCnt Use Metric Mask MTU Window IRTT
eth0 00000000 0101FEA9 0003 0 0 0 00000000 0 0 0
eth0 0101FEA9 00000000 0001 0 0 0 FFFFFFFF 0 0 0
eth0 751FA8C0 00000000 0001 0 0 0 FFFFFFFF 0 0 0
root@nginx-gvisor:/# route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
default 169.254.1.1 0.0.0.0 UG 0 0 0 eth0
169.254.1.1 0.0.0.0 255.255.255.255 U 0 0 0 eth0
192.168.31.117 0.0.0.0 255.255.255.255 U 0 0 0 eth0
TCPDump file: eni567d651201a.host.tcpdump.tar.gz Details about the network interface:
[ec2-user@ip-192-168-31-136 ~]$ ifconfig
eni567d651201a: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9001
inet6 fe80::58a9:b5ff:feda:27e5 prefixlen 64 scopeid 0x20<link>
ether 5a:a9:b5:da:27:e5 txqueuelen 0 (Ethernet)
RX packets 10 bytes 796 (796.0 B)
RX errors 0 dropped 2 overruns 0 frame 0
TX packets 5 bytes 446 (446.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
runsc
debug files:
96198907b56174067a1aa2b9c0fa3644670675b25fa28a7b44234fc232cccd5d.tar.gz
e4ec52fdad3e889bf386b1eca03e231ad53e0452e4bc623282732eba0d2da720.tar.gz
Environment
Please include the following details of your environment:
runsc -version
[ec2-user@ip-192-168-31-136 ~]$ runsc -version
runsc version release-20200622.1-171-gc66991ad7de6
spec: 1.0.1-dev
kubectl version
andkubectl get nodes -o wide
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.6-beta.0", GitCommit:"e7f962ba86f4ce7033828210ca3556393c377bcc", GitTreeState:"clean", BuildDate:"2020-01-15T08:26:26Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.8-eks-fd1ea7", GitCommit:"fd1ea7c64d0e3ccbf04b124431c659f65330562a", GitTreeState:"clean", BuildDate:"2020-05-28T19:06:00Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-192-168-31-136.us-west-2.compute.internal Ready <none> 3d3h v1.16.12-eks-904af05 192.168.31.136 35.161.102.17 Amazon Linux 2 4.14.181-142.260.amzn2.x86_64 containerd://1.3.2
ip-192-168-60-139.us-west-2.compute.internal Ready <none> 3d3h v1.16.12-eks-904af05 192.168.60.139 44.230.198.56 Amazon Linux 2 4.14.181-142.260.amzn2.x86_64 docker://19.3.6
uname -a
$ uname -a
Darwin moehajj-C02CJ1ARML7M 19.6.0 Darwin Kernel Version 19.6.0: Sun Jul 5 00:43:10 PDT 2020; root:xnu-6153.141.1~9/RELEASE_X86_64 x86_64
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 48 (34 by maintainers)
Commits related to this issue
- copy PERM ARP entries from namespace on boot copy and setup PERMANENT (static) ARP entries from CNI namespace to the sandbox Fixes #3301 — committed to pkit/gvisor by deleted user 3 years ago
- copy PERM ARP entries from namespace on boot copy and setup PERMANENT (static) ARP entries from CNI namespace to the sandbox Fixes #3301 — committed to pkit/gvisor by deleted user 3 years ago
- copy PERM ARP entries from namespace on boot copy and setup PERMANENT (static) ARP entries from CNI namespace to the sandbox Fixes #3301 — committed to pkit/gvisor by deleted user 3 years ago
- copy PERM ARP entries from namespace on boot copy and setup PERMANENT (static) ARP entries from CNI namespace to the sandbox Fixes #3301 — committed to pkit/gvisor by deleted user 3 years ago
- copy PERM ARP entries from namespace on boot copy and setup PERMANENT (static) ARP entries from CNI namespace to the sandbox Fixes #3301 FUTURE_COPYBARA_INTEGRATE_REVIEW=https://github.com/google/g... — committed to google/gvisor by pkit 3 years ago
- copy PERM ARP entries from namespace on boot copy and setup PERMANENT (static) ARP entries from CNI namespace to the sandbox Fixes #3301 FUTURE_COPYBARA_INTEGRATE_REVIEW=https://github.com/google/g... — committed to google/gvisor by pkit 3 years ago
- copy PERM ARP entries from namespace on boot copy and setup PERMANENT (static) ARP entries from CNI namespace to the sandbox Fixes #3301 FUTURE_COPYBARA_INTEGRATE_REVIEW=https://github.com/google/g... — committed to google/gvisor by pkit 3 years ago
See #6803 Checked it on actual
amazon-vpc-cni-k8s
and it indeed fixes the problem described here.fwiw, I’ve written a guide on setting up an EKS cluster with gVisor, and a custom runsc version of your choice, as the container runtime. I hope it serves as a helpful starting point 😄
Netstack uses route order to determine priority. Linux uses a more complicated algorithm. We have talked about implementing it in runsc and having runsc generate the netstack routing table.
Cool. As we already started working on it anyway. I hope I will submit a PR soon.
What happens is this: EKS relies on static arp entries for
169.254.1.1
to be present. Vanilla namespace for containerd CNI looks like this:For gvisor arp table is empty because here nothing regarding ARP is copied from the namespace. More than that, gvisor ARP neighboring, described here is used only in tests. I.e. bottom line: gvisor does not really expose any static ARP handling API neither to CNI nor to container itself. Fast fix would be probably to use that “testing” code to copy static entries at
runsc
boot and be done with it. Will try to do a PoC on that.That assessment was correct. But the
running a command inside the container
part is pretty funny, as that’s the first thing I’ve tried. Namely:Oops.
This issue dragging for 2 years is pretty interesting as it means nobody ever tried to use gvisor on EKS. And as a lot of CNI implementations rely on either static ARP entries or ARP proxy (both of which are not supported). I wonder if Google uses the same gvisor in GKE sandbox…
Hey Mohammed, that’s a great write-up! Just one small point – the write-up uses the unmaintained containerd-shim from https://github.com/google/gvisor-containerd-shim.git (see the warning at the top of the repository, and the fact that it is an “archive” repository).
Since about a year ago (3bb5f7164eefc6d5f33e7c08f87e2ae6df1c4bff), the shim has been built and shipped with the core repository and is included in releases as well. You can actually just install it directly from the bucket, like runsc itself, e.g.
wget https://storage.googleapis.com/gvisor/releases/release/latest/containerd-shim-runsc-v1
. This also saves you from needing the Go toolchain for the installation.Re:
kubectl port-forward
, it doesn’t work with runsc because containerd make assumptions about the container’s network that are not true for sandboxes. There are more details here: https://github.com/kubernetes/enhancements/issues/1846I haven’t yet gotten around to setting up my own EKS pod. It will take me sometime as I am not familiar with EKS much or AWS in general. That said, --network=host does not forward all ioctls and that’s probably why you see some failures. Netstack implements some of the ioctls that are needed for ifconfig and that’s why it works.
All netstack interfaces do support multicast/broadcast but I think we don’t set flags appropriately or don’t return them correctly for ifconfig to show them.
runsc does a few other things at startup as well, it steals the routes from the host for the interface being handed to runsc and passes them to runsc instead. So if you inquire the routes in namespace in which runsc is running you may not see all the rules as some of them have been stolen and handed to runsc at startup ( runsc removes the IP address from the host otherwise the host will respond to TCP SYN’s etc with RST as it won’t be aware of any listening sockets etc in Netstack).
I will see if i can figure out how setup EKS and post if i find something. But mostly it looks like maybe we need to scrape any arp entries from the namespace and pass them to runsc at startup. From what I can see the static arp is being installed in the namespace rather than by running a command inside the container by doing a docker exec. In such case the arp cache on the host will be updated but it is invisible to runsc.
Offhand looking at the tcpdump it looks like somehow runsc routing table/lookup is incorrect. Where Netstack is trying to resolve 169.254.1.1 by sending an ARP query and not getting anything back. I will have to setup a cluster to really see what might be going on.
But looking at the /proc/net/route I see that maybe runsc is not sorting the routes correctly.