cilium: kube-proxy-replacement breaks nfs reconnect

Is there an existing issue for this?

  • I have searched the existing issues

Update: This would apply to any in-kernel client, as when the connection is initiated from the kernel a different code-path is used when updating the service IP.

What happened?

We setup cilium using: cilium install --kube-proxy-replacement probe --helm-set k8sServiceHost=1.2.3.4,k8sServicePort=6443,global.etcd.enabled=true,global.etcd.managed=true,operator.replicas=3

We are using the NFS operator from here: https://github.com/openebs/dynamic-nfs-provisioner

We provision a pod attached to the NFS mounted volume, everything works successfully.

We then fake an outage on the node hosting the NFS server, the NFS pod is then rescheduled to another server.

When using kube-proxy-replacement the mount within the pod consuming the NFS server hangs indefinitely with:

[147766.740926] NFSD: Unable to initialize client recovery tracking! (-22)
[147766.740929] NFSD: starting 90-second grace period (net f000174e)
[147792.384961] nfs: server 10.76.145.123 not responding, timed out
[147819.009203] nfs: server 10.76.145.123 not responding, still trying
[147972.610076] nfs: server 10.76.145.123 not responding, timed out
[147977.733893] nfs: server 10.76.145.123 not responding, timed out
[147980.801919] nfs: server 10.76.145.123 not responding, still trying
[148152.834725] nfs: server 10.76.145.123 not responding, timed out
[148157.954807] nfs: server 10.76.145.123 not responding, timed out
[148163.075008] nfs: server 10.76.145.123 not responding, timed out

10.76.145.123 is the cluster IP of the NFS service

service/nfs-pvc-729ceb2e-0959-4d81-bc96-94fd37a73061   ClusterIP   10.76.145.123   <none>        2049/TCP,111/TCP   42m   openebs.io/nfs-server=nfs-pvc-729ceb2e-0959-4d81-bc96-94fd37a73061

When not using kube-proxy-replacement the mount reconnects automatically and everything works again.

Cilium Version

cilium-cli: v0.12.4 compiled with go1.19.1 on linux/amd64 cilium image (default): v1.12.2 cilium image (stable): v1.12.2 cilium image (running): v1.12.2

Kernel Version

Ubuntu 22.04

Linux dedi2-cplane1.23-106-58-144.lon-01.uk.appsolo.com 5.15.0-48-generic #54-Ubuntu SMP Fri Aug 26 13:26:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

v1.25.2

Sysdump

No response

Relevant log output

No response

Anything else?

No response

Code of Conduct

  • I agree to follow this project’s Code of Conduct

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 42 (13 by maintainers)

Commits related to this issue

Most upvoted comments

This kernel patch fixes the issue.

@aditighag Yes, I found that reconnect attempts are made when the remote connectivity breaks.

I wonder if this logic in cilium BPF programs suffices, or are there cases where your kernel patch would have to be extended for these sockets calls?

I actually drafted an earlier patch that performs an address copy for both kernel_sendmsg and kernel_bind, as in theory they could have the same issue. However, I ended up not including these in my patch because there weren’t any obvious scenarios that I could test that would be broken. The challenge with testing these calls is that you can’t call them directly from user space, so I’m limited to code paths reachable by using features from user space. For example, testing the change to kernel_sendmsg isn’t possible with NFS, because the NFS client code doesn’t use the msg_name field in msg even when operating in UDP mode.

AFAIK the kernel equivalent to getpeername isn’t at risk, because the address parameter is supposed to be set by the call.

I discovered this problem exists with SMB mounts as well due to this code:

	rc = socket->ops->connect(socket, saddr, slen,
				  server->noblockcnt ? O_NONBLOCK : 0);

which skips kernel_connect() and just calls socket->ops->connect(). One way to fix it is with this patch:

	rc = kernel_connect(socket, saddr, slen, server->noblockcnt ? O_NONBLOCK : 0);

However, there are a lot of other places in the kernel (Ceph driver included) where a raw call to socket->ops->connect() exists, and it’s not clear which of these might break when using Cilium. A better solution would be to push the address copy downward a bit in the stack from kernel_connect to where pre_connect() is called:

net/ipv4/af_inet.c:

net/mptcp/protocol.c:

Otherwise it will just turn into a game of whac-a-mole. This does lead to an unnecessary address copy in cases where the code path is reached from sys_connect(), but it doesn’t seem like this would have an appreciable performance impact.

FYI @borkmann

@Rid Looks like I’m running kube-proxy in iptables mode. I deployed kube-proxy and setting kube-proxy-replacement to disabled, but found that I needed to reboot my nodes before NFS reconnects worked as I wanted them. Before rebooting all my nodes, I tried restarting Cilium agents, restarting my NFS client pod, etc. but nothing worked until I rebooted my nodes. I’m assuming some other Kernel state or config was cleared by the reboots but not sure what.

@Rid Yes, the issue is that there’s only one copy of the address involved here. So any changes to this address will actually change the address in rpc_xprt.

My understanding is:

  1. The final step in tcp_v4_pre_connect is a call to any bpf hooks, this calls cilium.
  2. Cilium sees that the address (10.102.48.51) is a service address and rewrites it (in place) with an actual pod IP.
  3. tcp_v4_connect is called with the rewritten address.

Any rewrite by cilium above directly changes the actual address inside the rpc_xprt struct. So the rpc client now effectively loses track of the service IP.

When we consequently force a reconnect by using tcpkill, the rpc client will reconnect using the only address it knows about, which is now the pod IP (see the second xs_connect). This means that cilium can no longer rewrite the service address to any new pod IP if that becomes necessary, since it won’t see it.

I’m guessing that usually the address here is just a copy of something from userspace and can be rewritten without issue, but since the rpc client is part of the kernel it’s able to pass a pointer directly to its internal data instead. This probably breaks an assumption made somewhere.

Can you provide the Cilium sysdump?

I’m just setting up a cluster on digital ocean to test the reproducer outside of our setup. I will send you a message on Slack to provide you access.

Is 10.96.185.162 reachable from kind-worker?

Noup. But this gives me a good hint what might be happening. I will comment in a bit.