cni-ipvlan-vpc-k8s: Inconsistency between ENI allocated IPs and OS configuration

We are seeing an issue that seems to happens regularly: some pods have no network connectivity

After looking into the configuration it turns out that when this happens we are in the following situation:

  • pod sandbox configured properly (veth and ipvlan interfaces, as well as proper routing configurations)
  • IP of the pod not associated with the ENI so traffic is dropped by the VPC

After looking into logs we found the following:

  • cloudtrail shows a call to unassociate the IP address from the ENI (which seems to indicate that the CNI plugin was called with DELETE, but the routes and iptables rules are still there
  • the sandbox itself is not deleted. We found some errors in the kubelet logs, not sure it this is related:
failed to remove pod init container "consul-template": failed to get container status "371295090acf33795fe5badb07063021cace4fcff719cd13effc6ff2b5136f70": rpc error: code = Unknown desc = Error: No such container: 371295090acf33795fe5badb07063021cace4fcff719cd13effc6ff2b5136f70; Skipping pod "alerting-metric-evaluator-anomaly-0_datadog(4c15f7d2-5783-11e8-903a-02fc6d7aa9b8)"
  • kubelet tries to restart containers in the same sandbox (which fails because the pods have no network connectivity, which is required by the init container)

Any idea what could trigger this situation? Our current setup uses docker, kubelet 1.10 and the latest version of the CNI plugin.

I think SkipDeallocation could probably help but I’d like to understand exactly what is happening.

I wonder if allowing for more verbose logs could help in this kind of situation (for instance log ADD/DELETE calls with parameters)

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Comments: 16 (13 by maintainers)

Commits related to this issue

Most upvoted comments

Initial testing looks good, we are going to deploy to a larger cluster

We’re shipping a rc later this week that I’m hopeful will address this issue that you’ve been hitting – this is part of a refactor in conjunction with us moving to k8s 1.10. Will keep you updated.