azure-container-networking: plugin type="azure-vnet" failed (add): Failed to initialize key-value store of network plugin: timed out locking store

What happened: Intermittently, hours after a node in k8s was under load and all load was removed, a node will begin failing to create pods due to the issue: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "d321c69cd17621b4e34c044e14a05acde6854f356fb856567c9f7a034ba4ee11": plugin type="azure-vnet" failed (add): Failed to initialize key-value store of network plugin: timed out locking store

When viewing the logs from azure-vnet.log I see entries like this:

2022/04/27 21:57:06 [37520] CNI_COMMAND environment variable set to ADD
2022/04/27 21:57:06 [37520] [Azure-Utils] ver
2022/04/27 21:57:06 [37520] Formatted Boot time: 2022-04-20T09:20:14Z
2022/04/27 21:57:06 [37520] Acquiring process lock
2022/04/27 21:57:08 [76172] CNI_COMMAND environment variable set to ADD
2022/04/27 21:57:08 [76172] [Azure-Utils] ver
2022/04/27 21:57:08 [76172] Formatted Boot time: 2022-04-20T09:20:14Z
2022/04/27 21:57:08 [76172] Acquiring process lock
2022/04/27 21:57:10 [84620] CNI_COMMAND environment variable set to ADD
2022/04/27 21:57:10 [84620] [Azure-Utils] ver
2022/04/27 21:57:10 [84620] Formatted Boot time: 2022-04-20T09:20:14Z
2022/04/27 21:57:10 [84620] Acquiring process lock
2022/04/27 21:57:14 [45604] CNI_COMMAND environment variable set to ADD
2022/04/27 21:57:14 [45604] [Azure-Utils] ver
2022/04/27 21:57:14 [45604] Formatted Boot time: 2022-04-20T09:20:14Z
2022/04/27 21:57:14 [45604] Acquiring process lock
2022/04/27 21:57:16 [37520] [cni] Failed to lock store: timed out locking store.
2022/04/27 21:57:16 [37520] Failed to initialize key-value store of network plugin: timed out locking store
2022/04/27 21:57:16 [37520] Report plugin error

What you expected to happen:
A pod can be created after a load test.

How to reproduce it:
Load a windows node with many pods as fast as possible and delete them. Then attempt to create a single pod on that node. Repeat ad nauseum since this is intermittent.

Orchestrator and Version (e.g. Kubernetes, Docker):
Kubernetes, 1.21.7 Containerd, 1.6.1

Operating System (Linux/Windows):
Windows

Kernel (e.g. uanme -a for Linux or $(Get-ItemProperty -Path "C:\windows\system32\hal.dll").VersionInfo.FileVersion for Windows): 10.0.17763.2686 (WinBuild.160101.0800)

Anything else we need to know?: [Miscellaneous information that will assist in solving the issue.]

It seems like something has the file lock indefinitely. During a normal time frame the log Released process lock happens all the time but, before this issue happens, I don’t see Released process lock at then end of some requests (particularly CNI_COMMAND set to DEL). This may be a red herring (follow [78164] below):

2022/04/27 21:46:02 [78164] CNI_COMMAND environment variable set to DEL
2022/04/27 21:46:02 [78164] [Azure-Utils] ver
2022/04/27 21:46:02 [78164] Formatted Boot time: 2022-04-20T09:20:14Z
2022/04/27 21:46:02 [78164] Acquiring process lock
2022/04/27 21:46:02 [78164] Acquired process lock
2022/04/27 21:46:02 [78164] Connected to telemetry service
2022/04/27 21:46:02 [78164] [cni-net] Plugin azure-vnet version v1.4.22.
2022/04/27 21:46:02 [78164] [cni-net] Running on windows
2022/04/27 21:46:02 [78164] [net] Network interface: {Index:5 MTU:1500 Name:vEthernet (Ethernet 2) HardwareAddr:00:0d:3a:22:93:5b Flags:up|broadcast|multicast} with IP: [fe80::5026:2a15:7716:180e/64 10.206.39.16/22]
2022/04/27 21:46:02 [78164] [net] Network interface: {Index:1 MTU:-1 Name:Loopback Pseudo-Interface 1 HardwareAddr: Flags:up|loopback|multicast} with IP: [::1/128 127.0.0.1/8]
2022/04/27 21:46:02 [78164] [net] Restored state, &{Version:v1.4.22 TimeStamp:2022-04-27 21:45:51.1786315 +0000 UTC ExternalInterfaces:map[vEthernet (Ethernet 2):0xc00004a100] store:0xc000142940 netlink:0x2246b20 netio:0x2246b20 plClient:0x2246b20 Mutex:{state:0 sema:0}}
2022/04/27 21:46:02 [78164] External Interface &{Name:vEthernet (Ethernet 2) Networks:map[azure:0xc000107520] Subnets:[10.206.36.0/22] BridgeName: DNSInfo:{Suffix: Servers:[] Options:[]} MacAddress:00:0d:3a:22:93:5b IPAddresses:[] Routes:[] IPv4Gateway:0.0.0.0 IPv6Gateway:::}
2022/04/27 21:46:02 [78164] Number of endpoints: 43
2022/04/27 21:46:02 [78164] [cni-net] Plugin started.
2022/04/27 21:46:02 [78164] [cni-net] Processing DEL command with args {ContainerID:205288a5f9183f171ab50bf5560eb2ee5fb0b050e1a8848a4b443c83597adc9c Netns:b4ae0496-fb36-41df-81e2-f16016f3d546 IfName:eth0 Args:K8S_POD_NAMESPACE=laserprocloud-default-tmp973ae381pr41663;K8S_POD_NAME=b6f89650-b71b-4bb8-8f90-dc5f33e029b2-76d69f94d9-tnwl2;K8S_POD_INFRA_CONTAINER_ID=205288a5f9183f171ab50bf5560eb2ee5fb0b050e1a8848a4b443c83597adc9c;K8S_POD_UID=5c059fa1-66cc-4faf-8318-fa4a1ac0a649;IgnoreUnknown=1 Path:c:/k/azurecni/bin, StdinData:{"AdditionalArgs":[{"Name":"EndpointPolicy","Value":{"ExceptionList":["10.206.36.0/22","10.206.0.0/18"],"Type":"OutBoundNAT"}},{"Name":"EndpointPolicy","Value":{"DestinationPrefix":"100.91.0.0/16","NeedEncap":true,"Type":"ROUTE"}},{"Name":"EndpointPolicy","Value":{"Action":"Block","Direction":"Out","Priority":200,"Protocols":"6","RemoteAddresses":"168.63.129.16/32","RemotePorts":"80","RuleType":"Switch","Type":"ACL"}},{"Name":"EndpointPolicy","Value":{"Action":"Allow","Direction":"In","Priority":65500,"Type":"ACL"}},{"Name":"EndpointPolicy","Value":{"Action":"Allow","Direction":"Out","Priority":65500,"Type":"ACL"}}],"bridge":"azure0","capabilities":{"dns":true,"portMappings":true},"cniVersion":"0.3.0","dns":{"Nameservers":["100.91.0.10","168.63.129.16"],"Search":["svc.cluster.local"]},"ipam":{"type":"azure-vnet-ipam"},"mode":"bridge","name":"azure","runtimeConfig":{"dns":{"Servers":["100.91.0.10"],"Searches":["laserprocloud-default-tmp973ae381pr41663.svc.cluster.local","svc.cluster.local","cluster.local"],"Options":["ndots:5"]}},"type":"azure-vnet","windowsSettings":{"enableLoopbackDSR":true}}}.
2022/04/27 21:46:02 [78164] [cni-net] Read network configuration &{CNIVersion:0.3.0 Name:azure Type:azure-vnet Mode:bridge Master: AdapterName: Bridge:azure0 LogLevel: LogTarget: InfraVnetAddressSpace: IPV6Mode: ServiceCidrs: VnetCidrs: PodNamespaceForDualNetwork:[] IPsToRouteViaHost:[] MultiTenancy:false EnableSnatOnHost:false EnableExactMatchForPodName:false DisableHairpinOnHostInterface:false DisableIPTableLock:false CNSUrl: ExecutionMode: Ipam:{Type:azure-vnet-ipam Environment: AddrSpace: Subnet: Address: QueryInterval:} DNS:{Nameservers:[100.91.0.10 168.63.129.16] Domain: Search:[svc.cluster.local] Options:[]} RuntimeConfig:{PortMappings:[] DNS:{Servers:[100.91.0.10] Searches:[laserprocloud-default-tmp973ae381pr41663.svc.cluster.local svc.cluster.local cluster.local] Options:[ndots:5]}} WindowsSettings:{EnableLoopbackDSR:true} AdditionalArgs:[{Name:EndpointPolicy Value:[123 34 69 120 99 101 112 116 105 111 110 76 105 115 116 34 58 91 34 49 48 46 50 48 54 46 51 54 46 48 47 50 50 34 44 34 49 48 46 50 48 54 46 48 46 48 47 49 56 34 93 44 34 84 121 112 101 34 58 34 79 117 116 66 111 117 110 100 78 65 84 34 125]} {Name:EndpointPolicy Value:[123 34 68 101 115 116 105 110 97 116 105 111 110 80 114 101 102 105 120 34 58 34 49 48 48 46 57 49 46 48 46 48 47 49 54 34 44 34 78 101 101 100 69 110 99 97 112 34 58 116 114 117 101 44 34 84 121 112 101 34 58 34 82 79 85 84 69 34 125]} {Name:EndpointPolicy Value:[123 34 65 99 116 105 111 110 34 58 34 66 108 111 99 107 34 44 34 68 105 114 101 99 116 105 111 110 34 58 34 79 117 116 34 44 34 80 114 105 111 114 105 116 121 34 58 50 48 48 44 34 80 114 111 116 111 99 111 108 115 34 58 34 54 34 44 34 82 101 109 111 116 101 65 100 100 114 101 115 115 101 115 34 58 34 49 54 56 46 54 51 46 49 50 57 46 49 54 47 51 50 34 44 34 82 101 109 111 116 101 80 111 114 116 115 34 58 34 56 48 34 44 34 82 117 108 101 84 121 112 101 34 58 34 83 119 105 116 99 104 34 44 34 84 121 112 101 34 58 34 65 67 76 34 125]} {Name:EndpointPolicy Value:[123 34 65 99 116 105 111 110 34 58 34 65 108 108 111 119 34 44 34 68 105 114 101 99 116 105 111 110 34 58 34 73 110 34 44 34 80 114 105 111 114 105 116 121 34 58 54 53 53 48 48 44 34 84 121 112 101 34 58 34 65 67 76 34 125]} {Name:EndpointPolicy Value:[123 34 65 99 116 105 111 110 34 58 34 65 108 108 111 119 34 44 34 68 105 114 101 99 116 105 111 110 34 58 34 79 117 116 34 44 34 80 114 105 111 114 105 116 121 34 58 54 53 53 48 48 44 34 84 121 112 101 34 58 34 65 67 76 34 125]}]}.
2022/04/27 21:46:02 [78164] Get number of endpoints for ifname vEthernet (Ethernet 2) network azure
2022/04/27 21:46:02 [78164] Execution mode :
2022/04/27 21:46:02 [78164] Trying to retrieve endpoint id 205288a5-eth0
2022/04/27 21:46:02 [78164] Deleting endpoint:205288a5-eth0
2022/04/27 21:46:02 [78164] [net] Deleting endpoint 205288a5-eth0 from network azure.
2022/04/27 21:46:02 [78164] Trying to retrieve endpoint id 205288a5-eth0
2022/04/27 21:46:02 [78164] [net] Deleting hcn endpoint with id: b224a8c0-e093-4104-b397-4b8208c238b1
2022/04/27 21:46:11 [59832] CNI_COMMAND environment variable set to ADD
2022/04/27 21:46:11 [59832] [Azure-Utils] ver
2022/04/27 21:46:11 [59832] Formatted Boot time: 2022-04-20T09:20:14Z
2022/04/27 21:46:11 [59832] Acquiring process lock
2022/04/27 21:46:20 [65268] CNI_COMMAND environment variable set to DEL
2022/04/27 21:46:20 [65268] [Azure-Utils] ver
2022/04/27 21:46:20 [65268] Formatted Boot time: 2022-04-20T09:20:14Z
2022/04/27 21:46:20 [65268] Acquiring process lock
2022/04/27 21:46:20 [66304] CNI_COMMAND environment variable set to DEL
2022/04/27 21:46:20 [66304] [Azure-Utils] ver
2022/04/27 21:46:20 [66304] Formatted Boot time: 2022-04-20T09:20:14Z
2022/04/27 21:46:20 [66304] Acquiring process lock
2022/04/27 21:46:21 [84160] CNI_COMMAND environment variable set to DEL
2022/04/27 21:46:21 [84160] [Azure-Utils] ver
2022/04/27 21:46:21 [59832] [cni] Failed to lock store: timed out locking store.
2022/04/27 21:46:21 [59832] Failed to initialize key-value store of network plugin: timed out locking store
2022/04/27 21:46:21 [59832] Report plugin error

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 20 (8 by maintainers)

Most upvoted comments

@MugzyMoreno Apologies for the delay… I will follow up internally with another team to debug this…