meshnet-cni: Bug: Meshnet `cmdDel` is not idempotent, leading to state changes when CNI ADD and CNI DEL commands overlap

I need help debugging this, b/c I really don’t know what’s going on.

When I create pods connected with meshnet, sometimes meshnet doesn’t add the declared interfaces. This can happen at deployment or redeployment of a static pod. I set up my static pods with init containers that wait on the expected interfaces to be added, so it’s easy to see when this happens–the Pod gets stuck in Init:0.

What I need to be able to do is reliably delete and restore static Pods from a topology. I do this by deleting the Pod, and /most/ of the time this works flawlessly–but sometimes it doesn’t. It’s difficult to reproduce this reliably. When it happens, I can sometimes recover by deleting the entire topology (all pods and topos) and re-deploying it. Sometimes the condition persists and I have to rollout restart meshnet. And sometimes I have to reboot the node or cluster to recover.

Looking at the meshnet log, when this happens meshnet thinks the other pod isn’t up. As an example, at startup lf1 and lf5 link fine:

time="2023-03-22T15:27:06Z" level=info msg="Creating Veth struct with NetNS:/var/run/netns/cni-83b86931-410c-4f80-9846-bf9539cee642 and intfName: eth4, IP:192.168.0.6/31"
time="2023-03-22T15:27:06Z" level=info msg="Pod lf1 is retrieving peer pod lf5 information from meshnet daemon"
time="2023-03-22T15:27:06Z" level=info msg="Is peer pod lf5 alive?: true"
time="2023-03-22T15:27:06Z" level=info msg="Peer pod lf5 is alive"
time="2023-03-22T15:27:06Z" level=info msg="lf1 and lf5 are on the same host"

Then I delete pod lf1, wait a bit, then re-deploy lf1 from the same resource declaration. Then I get:

time="2023-03-22T15:39:06Z" level=info msg="Creating Veth struct with NetNS:/var/run/netns/cni-fff6bab6-5bb0-e511-f423-282bde768df2 and intfName: eth4, IP:192.168.0.6/31"
time="2023-03-22T15:39:06Z" level=info msg="Pod lf1 is retrieving peer pod lf5 information from meshnet daemon"
time="2023-03-22T15:39:06Z" level=info msg="Is peer pod lf5 alive?: false"
time="2023-03-22T15:39:06Z" level=info msg="Peer pod lf5 isn't alive yet, continuing"

meshnetd’s log looks normal:

2023-03-22T10:39:06-05:00	time="2023-03-22T15:39:06Z" level=info msg="Retrieving lf5's metadata from K8s..." daemon=meshnetd
2023-03-22T10:39:06-05:00	time="2023-03-22T15:39:06Z" level=info msg="Reading pod lf5 from K8s" daemon=meshnetd
2023-03-22T10:39:06-05:00	time="2023-03-22T15:39:06Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=Get grpc.service=meshnet.v1beta1.Local grpc.start_time="2023-03-22T15:39:06Z" grpc.time_ms=8.889 peer.address="[::1]:46588" span.kind=server system=grpc
2023-03-22T10:39:06-05:00	time="2023-03-22T15:39:06Z" level=info msg="Skipping of pod lf5 by pod lf1" daemon=meshnetd
2023-03-22T10:39:06-05:00	time="2023-03-22T15:39:06Z" level=info msg="Reading pod lf1 from K8s" daemon=meshnetd
2023-03-22T10:39:06-05:00	time="2023-03-22T15:39:06Z" level=info msg="Update pod status lf1 from K8s" daemon=meshnetd
2023-03-22T10:39:06-05:00	time="2023-03-22T15:39:06Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=Skip grpc.service=meshnet.v1beta1.Local grpc.start_time="2023-03-22T15:39:06Z" grpc.time_ms=27.629 peer.address="[::1]:46588" span.kind=server system=grpc

I’ve tried /also/ deleting the lf1 and/or lf5 topo, but that doesn’t seem to have an effect.

If I recover and redeploy and eventually recreate this condition, the affected pods or interfaces will change even with the same topology and pod declarations.

Am I bumping up against some resource limit somewhere?

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 26 (26 by maintainers)

Most upvoted comments

@Cerebus Yes, I was also working on handling multiple delete. This is common K8S behavior but was not handled in meshnet. Pull #74 looks good to me. I will stop on my part and will wait for pull #74 to come in.

Interleaving of ADD and DEL we need to handle. If a DEL comes while ADD is in motion, meshnet will be in trouble. I will work on this one.