cilium: kube-proxy-replacement breaks nfs reconnect
Is there an existing issue for this?
- I have searched the existing issues
Update: This would apply to any in-kernel client, as when the connection is initiated from the kernel a different code-path is used when updating the service IP.
What happened?
We setup cilium using:
cilium install --kube-proxy-replacement probe --helm-set k8sServiceHost=1.2.3.4,k8sServicePort=6443,global.etcd.enabled=true,global.etcd.managed=true,operator.replicas=3
We are using the NFS operator from here: https://github.com/openebs/dynamic-nfs-provisioner
We provision a pod attached to the NFS mounted volume, everything works successfully.
We then fake an outage on the node hosting the NFS server, the NFS pod is then rescheduled to another server.
When using kube-proxy-replacement the mount within the pod consuming the NFS server hangs indefinitely with:
[147766.740926] NFSD: Unable to initialize client recovery tracking! (-22)
[147766.740929] NFSD: starting 90-second grace period (net f000174e)
[147792.384961] nfs: server 10.76.145.123 not responding, timed out
[147819.009203] nfs: server 10.76.145.123 not responding, still trying
[147972.610076] nfs: server 10.76.145.123 not responding, timed out
[147977.733893] nfs: server 10.76.145.123 not responding, timed out
[147980.801919] nfs: server 10.76.145.123 not responding, still trying
[148152.834725] nfs: server 10.76.145.123 not responding, timed out
[148157.954807] nfs: server 10.76.145.123 not responding, timed out
[148163.075008] nfs: server 10.76.145.123 not responding, timed out
10.76.145.123 is the cluster IP of the NFS service
service/nfs-pvc-729ceb2e-0959-4d81-bc96-94fd37a73061 ClusterIP 10.76.145.123 <none> 2049/TCP,111/TCP 42m openebs.io/nfs-server=nfs-pvc-729ceb2e-0959-4d81-bc96-94fd37a73061
When not using kube-proxy-replacement the mount reconnects automatically and everything works again.
Cilium Version
cilium-cli: v0.12.4 compiled with go1.19.1 on linux/amd64 cilium image (default): v1.12.2 cilium image (stable): v1.12.2 cilium image (running): v1.12.2
Kernel Version
Ubuntu 22.04
Linux dedi2-cplane1.23-106-58-144.lon-01.uk.appsolo.com 5.15.0-48-generic #54-Ubuntu SMP Fri Aug 26 13:26:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Kubernetes Version
v1.25.2
Sysdump
No response
Relevant log output
No response
Anything else?
No response
Code of Conduct
- I agree to follow this project’s Code of Conduct
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 42 (13 by maintainers)
This kernel patch fixes the issue.
@aditighag Yes, I found that reconnect attempts are made when the remote connectivity breaks.
I actually drafted an earlier patch that performs an address copy for both
kernel_sendmsgandkernel_bind, as in theory they could have the same issue. However, I ended up not including these in my patch because there weren’t any obvious scenarios that I could test that would be broken. The challenge with testing these calls is that you can’t call them directly from user space, so I’m limited to code paths reachable by using features from user space. For example, testing the change tokernel_sendmsgisn’t possible with NFS, because the NFS client code doesn’t use themsg_namefield inmsgeven when operating in UDP mode.AFAIK the kernel equivalent to
getpeernameisn’t at risk, because the address parameter is supposed to be set by the call.I discovered this problem exists with SMB mounts as well due to this code:
which skips
kernel_connect()and just callssocket->ops->connect(). One way to fix it is with this patch:However, there are a lot of other places in the kernel (Ceph driver included) where a raw call to
socket->ops->connect()exists, and it’s not clear which of these might break when using Cilium. A better solution would be to push the address copy downward a bit in the stack fromkernel_connectto wherepre_connect()is called:net/ipv4/af_inet.c:
inet_dgram_connectinet_stream_connectnet/mptcp/protocol.c:
mptcp_connectOtherwise it will just turn into a game of whac-a-mole. This does lead to an unnecessary address copy in cases where the code path is reached from
sys_connect(), but it doesn’t seem like this would have an appreciable performance impact.FYI @borkmann
@Rid Looks like I’m running
kube-proxyin iptables mode. I deployedkube-proxyand settingkube-proxy-replacementtodisabled, but found that I needed to reboot my nodes before NFS reconnects worked as I wanted them. Before rebooting all my nodes, I tried restarting Cilium agents, restarting my NFS client pod, etc. but nothing worked until I rebooted my nodes. I’m assuming some other Kernel state or config was cleared by the reboots but not sure what.@Rid Yes, the issue is that there’s only one copy of the address involved here. So any changes to this address will actually change the address in rpc_xprt.
My understanding is:
Any rewrite by cilium above directly changes the actual address inside the rpc_xprt struct. So the rpc client now effectively loses track of the service IP.
When we consequently force a reconnect by using tcpkill, the rpc client will reconnect using the only address it knows about, which is now the pod IP (see the second xs_connect). This means that cilium can no longer rewrite the service address to any new pod IP if that becomes necessary, since it won’t see it.
I’m guessing that usually the address here is just a copy of something from userspace and can be rewritten without issue, but since the rpc client is part of the kernel it’s able to pass a pointer directly to its internal data instead. This probably breaks an assumption made somewhere.
I’m just setting up a cluster on digital ocean to test the reproducer outside of our setup. I will send you a message on Slack to provide you access.
Noup. But this gives me a good hint what might be happening. I will comment in a bit.