gvisor: DNS fails on gVisor using netstack on EKS

Description

I’m deploying Pods on my EKS cluster using the gVisor runtime, however the outbound network requests fail while inbound requests succeed. The issue is mitigated when using network=host in the runsc config options.

Steps to reproduce

I created a 2 node EKS cluster and configured a node to use conatinerd as a container CRI and configured the gVisor runtime with containerd (following this tutorial). I also labeled the node I selected for gVisor with app=gvisor.

EKS Cluster Nodes: (you can see the first node using containerd as it’s container runtime)

kubectl get nodes -o wide
NAME                                           STATUS   ROLES    AGE    VERSION                INTERNAL-IP      EXTERNAL-IP     OS-IMAGE         KERNEL-VERSION                  CONTAINER-RUNTIME
ip-192-168-31-136.us-west-2.compute.internal   Ready    <none>   3d1h   v1.16.12-eks-904af05   192.168.31.136   35.161.102.17   Amazon Linux 2   4.14.181-142.260.amzn2.x86_64   containerd://1.3.2
ip-192-168-60-139.us-west-2.compute.internal   Ready    <none>   3d1h   v1.16.12-eks-904af05   192.168.60.139   44.230.198.56   Amazon Linux 2   4.14.181-142.260.amzn2.x86_64   docker://19.3.6

runsc config on gVisor node:

[ec2-user@ip-192-168-31-136 ~]$ ls /etc/containerd/
config.toml  runsc.toml
[ec2-user@ip-192-168-31-136 ~]$ cat /etc/containerd/config.toml 
disabled_plugins = ["restart"]
[plugins.linux]
  shim_debug = true
[plugins.cri.containerd.runtimes.runsc]
  runtime_type = "io.containerd.runsc.v1"
[plugins.cri.containerd.runtimes.runsc.options]
  TypeUrl = "io.containerd.runsc.v1.options"
  ConfigPath = "/etc/containerd/runsc.toml"
[ec2-user@ip-192-168-31-136 ~]$ cat /etc/containerd/runsc.toml 
[runsc_config]
  debug="true"
  strace="true"
  log-packets="true"
  debug-log="/tmp/runsc/%ID%/"

I applied a gVisor runtime class to my cluster:

cat << EOF | tee gvisor-runtime.yaml 
apiVersion: node.k8s.io/v1beta1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc
EOF

kubectl apply -f gvisor-runtime.yaml

And ran a simple nginx Pod using the gvisor runtime:

cat << EOF | tee nginx-gvisor.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: nginx-gvisor
spec:
  containers:
  - name: my-nginx
    image: nginx
    ports:                    
    - containerPort: 80
  nodeSelector:
    app: gvisor
  runtimeClassName: gvisor
EOF

kubectl create -f nginx-gvisor.yaml

To verify the Pod is running with gVisor:

# Get the container ID
kubectl get pod nginx-gvisor -o jsonpath='{.status.containerStatuses[0].containerID}' 
containerd://9f71a133fc27c3a305710552489c16977d5c48cd40f31810c2010dac393c5ba7%  

# List conatienrs running with runsc on gVisor node
[ec2-user@ip-192-168-31-136 gvisor]$ sudo env "PATH=$PATH" runsc --root /run/containerd/runsc/k8s.io list -quiet
9411dfee3811da9dd45e8681f697bcf5326173d6510238ce70beb02ffe00f444
9f71a133fc27c3a305710552489c16977d5c48cd40f31810c2010dac393c5ba7

To test the inbound network traffic of the Pod, I simply curled port 80 of the Pod and it succeeded. To test the outbound network traffic of the Pod, I did the following:

kubectl exec --stdin --tty nginx-gvisor -- /bin/bash
root@nginx-gvisor:/# apt-get update
Err:1 http://security.debian.org/debian-security buster/updates InRelease
  Temporary failure resolving 'security.debian.org'
Err:2 http://deb.debian.org/debian buster InRelease
  Temporary failure resolving 'deb.debian.org'
Err:3 http://deb.debian.org/debian buster-updates InRelease
  Temporary failure resolving 'deb.debian.org'
Reading package lists... Done
W: Failed to fetch http://deb.debian.org/debian/dists/buster/InRelease  Temporary failure resolving 'deb.debian.org'
W: Failed to fetch http://security.debian.org/debian-security/dists/buster/updates/InRelease  Temporary failure resolving 'security.debian.org'
W: Failed to fetch http://deb.debian.org/debian/dists/buster-updates/InRelease  Temporary failure resolving 'deb.debian.org'
W: Some index files failed to download. They have been ignored, or old ones used instead.

You can see that it fails. Other attempts such as wget www.google.com fail as well.

For debug purposes, these are the DNS and routing tables (without net-tools, since I couldn’t install them) in the Pod container:

root@nginx-gvisor:/# cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local us-west-2.compute.internal
nameserver 10.100.0.10
options ndots:5
root@nginx-gvisor:/# cat /proc/net/route
Iface   Destination     Gateway Flags   RefCnt  Use     Metric  Mask    MTU     Window  IRTT
eth0    0101FEA9        00000000        0001    0       0       0       FFFFFFFF        0       0       0
eth0    00000000        0101FEA9        0003    0       0       0       00000000        0       0       0

I also captured the tcpdump packets on the ENI network interface for the Pod allocated by EKS: eni567d651201a.nohost.tcpdump.tar.gz. Details about the network interface:

[ec2-user@ip-192-168-31-136 ~]$ ifconfig
eni567d651201a: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet6 fe80::4cfa:44ff:fe5d:9495  prefixlen 64  scopeid 0x20<link>
        ether 4e:fa:44:5d:94:95  txqueuelen 0  (Ethernet)
        RX packets 3  bytes 270 (270.0 B)
        RX errors 0  dropped 2  overruns 0  frame 0
        TX packets 5  bytes 446 (446.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

I also captured runsc debug information for the containers in the Pod: 9f71a133fc27c3a305710552489c16977d5c48cd40f31810c2010dac393c5ba7.tar.gz 9411dfee3811da9dd45e8681f697bcf5326173d6510238ce70beb02ffe00f444.tar.gz

Now to verify that it works when the Pod is using the host network, I added network="host" to the /etc/containerd/runsc.toml file and restarted containerd. I reran the same experiment above with the following results:

Verify running Pod:

# Get the container ID
kubectl get pod nginx-gvisor -o jsonpath='{.status.containerStatuses[0].containerID}' 
containerd://e4ec52fdad3e889bf386b1eca03e231ad53e0452e4bc623282732eba0d2da720%    

# List conatienrs running with runsc on gVisor node
[ec2-user@ip-192-168-31-136 gvisor]$ sudo env "PATH=$PATH" runsc --root /run/containerd/runsc/k8s.io list -quiet
96198907b56174067a1aa2b9c0fa3644670675b25fa28a7b44234fc232cccd5d
e4ec52fdad3e889bf386b1eca03e231ad53e0452e4bc623282732eba0d2da720

Successful inbound with curl, and successful outbound as follows:

kubectl exec --stdin --tty nginx-gvisor -- /bin/bash
root@nginx-gvisor:/# apt-get update
Get:1 http://security.debian.org/debian-security buster/updates InRelease [65.4 kB]
Get:2 http://security.debian.org/debian-security buster/updates/main amd64 Packages [213 kB]
Get:3 http://deb.debian.org/debian buster InRelease [121 kB]
Get:4 http://deb.debian.org/debian buster-updates InRelease [51.9 kB]
Get:5 http://deb.debian.org/debian buster/main amd64 Packages [7905 kB]
Get:6 http://deb.debian.org/debian buster-updates/main amd64 Packages [7868 B]
Fetched 8364 kB in 6s (1462 kB/s)
Reading package lists... Done

DNS and routing table (with net-tools this time) on Pod:

root@nginx-gvisor:/# cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local us-west-2.compute.internal
nameserver 10.100.0.10
options ndots:5
root@nginx-gvisor:/# cat /proc/net/route
Iface   Destination     Gateway Flags   RefCnt  Use     Metric  Mask    MTU     Window  IRTT
eth0    00000000        0101FEA9        0003    0       0       0       00000000        0       0       0
eth0    0101FEA9        00000000        0001    0       0       0       FFFFFFFF        0       0       0
eth0    751FA8C0        00000000        0001    0       0       0       FFFFFFFF        0       0       0
root@nginx-gvisor:/# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         169.254.1.1     0.0.0.0         UG    0      0        0 eth0
169.254.1.1     0.0.0.0         255.255.255.255 U     0      0        0 eth0
192.168.31.117  0.0.0.0         255.255.255.255 U     0      0        0 eth0

TCPDump file: eni567d651201a.host.tcpdump.tar.gz Details about the network interface:

[ec2-user@ip-192-168-31-136 ~]$ ifconfig
eni567d651201a: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet6 fe80::58a9:b5ff:feda:27e5  prefixlen 64  scopeid 0x20<link>
        ether 5a:a9:b5:da:27:e5  txqueuelen 0  (Ethernet)
        RX packets 10  bytes 796 (796.0 B)
        RX errors 0  dropped 2  overruns 0  frame 0
        TX packets 5  bytes 446 (446.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

runsc debug files: 96198907b56174067a1aa2b9c0fa3644670675b25fa28a7b44234fc232cccd5d.tar.gz e4ec52fdad3e889bf386b1eca03e231ad53e0452e4bc623282732eba0d2da720.tar.gz

Environment

Please include the following details of your environment:

runsc -version

[ec2-user@ip-192-168-31-136 ~]$ runsc -version
runsc version release-20200622.1-171-gc66991ad7de6
spec: 1.0.1-dev

kubectl version and kubectl get nodes -o wide

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.6-beta.0", GitCommit:"e7f962ba86f4ce7033828210ca3556393c377bcc", GitTreeState:"clean", BuildDate:"2020-01-15T08:26:26Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.8-eks-fd1ea7", GitCommit:"fd1ea7c64d0e3ccbf04b124431c659f65330562a", GitTreeState:"clean", BuildDate:"2020-05-28T19:06:00Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}

$ kubectl get nodes -o wide                                                            
NAME                                           STATUS   ROLES    AGE    VERSION                INTERNAL-IP      EXTERNAL-IP     OS-IMAGE         KERNEL-VERSION                  CONTAINER-RUNTIME
ip-192-168-31-136.us-west-2.compute.internal   Ready    <none>   3d3h   v1.16.12-eks-904af05   192.168.31.136   35.161.102.17   Amazon Linux 2   4.14.181-142.260.amzn2.x86_64   containerd://1.3.2
ip-192-168-60-139.us-west-2.compute.internal   Ready    <none>   3d3h   v1.16.12-eks-904af05   192.168.60.139   44.230.198.56   Amazon Linux 2   4.14.181-142.260.amzn2.x86_64   docker://19.3.6

uname -a

$ uname -a
Darwin moehajj-C02CJ1ARML7M 19.6.0 Darwin Kernel Version 19.6.0: Sun Jul  5 00:43:10 PDT 2020; root:xnu-6153.141.1~9/RELEASE_X86_64 x86_64

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 1
Comments: 48 (34 by maintainers)

Commits related to this issue

copy PERM ARP entries from namespace on boot copy and setup PERMANENT (static) ARP entries from CNI namespace to the sandbox Fixes #3301 — committed to pkit/gvisor by deleted user 3 years ago
copy PERM ARP entries from namespace on boot copy and setup PERMANENT (static) ARP entries from CNI namespace to the sandbox Fixes #3301 — committed to pkit/gvisor by deleted user 3 years ago
copy PERM ARP entries from namespace on boot copy and setup PERMANENT (static) ARP entries from CNI namespace to the sandbox Fixes #3301 — committed to pkit/gvisor by deleted user 3 years ago
copy PERM ARP entries from namespace on boot copy and setup PERMANENT (static) ARP entries from CNI namespace to the sandbox Fixes #3301 — committed to pkit/gvisor by deleted user 3 years ago
copy PERM ARP entries from namespace on boot copy and setup PERMANENT (static) ARP entries from CNI namespace to the sandbox Fixes #3301 FUTURE_COPYBARA_INTEGRATE_REVIEW=https://github.com/google/g... — committed to google/gvisor by pkit 3 years ago
copy PERM ARP entries from namespace on boot copy and setup PERMANENT (static) ARP entries from CNI namespace to the sandbox Fixes #3301 FUTURE_COPYBARA_INTEGRATE_REVIEW=https://github.com/google/g... — committed to google/gvisor by pkit 3 years ago
copy PERM ARP entries from namespace on boot copy and setup PERMANENT (static) ARP entries from CNI namespace to the sandbox Fixes #3301 FUTURE_COPYBARA_INTEGRATE_REVIEW=https://github.com/google/g... — committed to google/gvisor by pkit 3 years ago

Most upvoted comments

See #6803 Checked it on actual amazon-vpc-cni-k8s and it indeed fixes the problem described here.

pkit on Oct 31, 2021

fwiw, I’ve written a guide on setting up an EKS cluster with gVisor, and a custom runsc version of your choice, as the container runtime. I hope it serves as a helpful starting point 😄

moehajj on Oct 22, 2021

Netstack uses route order to determine priority. Linux uses a more complicated algorithm. We have talked about implementing it in runsc and having runsc generate the netstack routing table.

iangudger on Jul 22, 2020

Cool. As we already started working on it anyway. I hope I will submit a PR soon.

pkit on Oct 29, 2021

What happens is this: EKS relies on static arp entries for 169.254.1.1 to be present. Vanilla namespace for containerd CNI looks like this:

$ sudo ip netns exec cni-661976d9-58c3-ce5e-b781-37ad4d95628f arp -a
gateway (169.254.1.1) at 12:cf:1e:29:a2:df [ether] PERM on eth0
gateway (169.254.1.1) at 12:cf:1e:29:a2:df [ether] PERM on eth0

For gvisor arp table is empty because here nothing regarding ARP is copied from the namespace. More than that, gvisor ARP neighboring, described here is used only in tests. I.e. bottom line: gvisor does not really expose any static ARP handling API neither to CNI nor to container itself. Fast fix would be probably to use that “testing” code to copy static entries at runsc boot and be done with it. Will try to do a PoC on that.

From what I can see the static arp is being installed in the namespace rather than by running a command inside the container by doing a docker exec. In such case the arp cache on the host will be updated but it is invisible to runsc.

That assessment was correct. But the running a command inside the container part is pretty funny, as that’s the first thing I’ve tried. Namely:

bash-5.1# arp -i eth0 -s 169.254.1.1 be:b2:bf:4c:f9:8d
SIOCSARP: Not a tty
bash-5.1# ip neighbor add 169.254.1.1 lladdr be:b2:bf:4c:f9:8d dev eth0 nud permanent
RTNETLINK answers: Permission denied
bash-5.1# arp -a
bash-5.1# ip neighbor show
RTNETLINK answers: Not supported
Dump terminated

Oops.

This issue dragging for 2 years is pretty interesting as it means nobody ever tried to use gvisor on EKS. And as a lot of CNI implementations rely on either static ARP entries or ARP proxy (both of which are not supported). I wonder if Google uses the same gvisor in GKE sandbox…

pkit on Oct 28, 2021

Hey Mohammed, that’s a great write-up! Just one small point – the write-up uses the unmaintained containerd-shim from https://github.com/google/gvisor-containerd-shim.git (see the warning at the top of the repository, and the fact that it is an “archive” repository).

Since about a year ago (3bb5f7164eefc6d5f33e7c08f87e2ae6df1c4bff), the shim has been built and shipped with the core repository and is included in releases as well. You can actually just install it directly from the bucket, like runsc itself, e.g.wget https://storage.googleapis.com/gvisor/releases/release/latest/containerd-shim-runsc-v1. This also saves you from needing the Go toolchain for the installation.

amscanne on Oct 22, 2021

Re: kubectl port-forward, it doesn’t work with runsc because containerd make assumptions about the container’s network that are not true for sandboxes. There are more details here: https://github.com/kubernetes/enhancements/issues/1846

fvoznika on Jul 31, 2020

I haven’t yet gotten around to setting up my own EKS pod. It will take me sometime as I am not familiar with EKS much or AWS in general. That said, --network=host does not forward all ioctls and that’s probably why you see some failures. Netstack implements some of the ioctls that are needed for ifconfig and that’s why it works.

All netstack interfaces do support multicast/broadcast but I think we don’t set flags appropriately or don’t return them correctly for ifconfig to show them.

runsc does a few other things at startup as well, it steals the routes from the host for the interface being handed to runsc and passes them to runsc instead. So if you inquire the routes in namespace in which runsc is running you may not see all the rules as some of them have been stolen and handed to runsc at startup ( runsc removes the IP address from the host otherwise the host will respond to TCP SYN’s etc with RST as it won’t be aware of any listening sockets etc in Netstack).

I will see if i can figure out how setup EKS and post if i find something. But mostly it looks like maybe we need to scrape any arp entries from the namespace and pass them to runsc at startup. From what I can see the static arp is being installed in the namespace rather than by running a command inside the container by doing a docker exec. In such case the arp cache on the host will be updated but it is invisible to runsc.

hbhasker on Jul 22, 2020

Offhand looking at the tcpdump it looks like somehow runsc routing table/lookup is incorrect. Where Netstack is trying to resolve 169.254.1.1 by sending an ARP query and not getting anything back. I will have to setup a cluster to really see what might be going on.

But looking at the /proc/net/route I see that maybe runsc is not sorting the routes correctly.

hbhasker on Jul 22, 2020