azure-container-networking: azure-npm randomly corrupts ip-tables. Results in connection failures between pods

What happened:

We have 3 AKS clusters, 3 nodes each. Mix of B16 and D16 nodes.

We use NetworkPolicies to restrict traffic between pods. We define pretty strict rules (per-pod & namespace).

azure-npm randomly causes the pod connections to fail, as it seems the IP table rules are getting corrupted.

Sometimes, when I inspect the clusters in the morning, I have multiple pods failing due to azure-npm corrupting the iptables, without having changed or touched anything overnight.

For example, we use argocd to manage our deployments. argocd-server will randomly fail to contact the argocd-repo-server.

Killing the azure-npm pods solves the problem. But this is not a viable solution.

I sometimes see this error in azure-npm

2021/04/23 07:06:04 [1] Error: There was an error running command: [ipset -X -exist azure-npm-784554818] Stderr: [exit status 1, ipset v7.5: Set cannot be destroyed: it is in use by a kernel component[]

What you expected to happen:

azure-npm to correctly define IP-table rules. We would expect AKS to have a bug-free CNI, as this is such a critical component of the infrastructure!

I tried upgrading azure-npm to 1.3.0, but it seems that AKS automatically manages this, and will downgrade to 1.1.8

How to reproduce it:

Very hard to say. It sometimes happens when the labels on the pods/namespaces change. But also happen randomly. Help in debugging this would be greatly appreciated.

Orchestrator and Version (e.g. Kubernetes, Docker):

AKS 1.20.2 azure-npm mcr.microsoft.com/containernetworking/azure-npm:v1.1.8`

Operating System (Linux/Windows):

Linux

Kernel (e.g. uanme -a for Linux or $(Get-ItemProperty -Path "C:\windows\system32\hal.dll").VersionInfo.FileVersion for Windows):

5.4.0-1043-azure #45~18.04.1-Ubuntu SMP Sat Mar 20 16:16:05 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Anything else we need to know?: [Miscellaneous information that will assist in solving the issue.]

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 17 (9 by maintainers)

Most upvoted comments

Thanks for the input. I checked it without the ipv6 cidr block and it still does not work. I asked to have our case assigned/to you. Thanks for your help!

It seems that the issue is not present anymore

@BenjaminHerbert thank you for the debugging session, as discussed, you are hitting this #870 known issue.

@neaggarwMS I have opened an support issue with id 2105040050002654 and discuss the private things there.