kubernetes: Kube-proxy facing locking timeout in large clusters during load test with services enabled

Follows from discussion in https://github.com/kubernetes/kubernetes/issues/48052

We noticed this while performing load test on 4000 node clusters with services enabled. The iptables restore step in the proxier fails with:

E0625 09:03:14.873338       5 proxier.go:1574] Failed to execute iptables-restore: failed to acquire old iptables lock: timed out waiting for the condition

And the reason quite likely is because of “huge” size of iptables (tens of MBs) as we run 30 pods per node and each pod is part of exactly one service => 30 * 4000 = 120k service endpoints (and these updates happen on all 4000 nodes)

cc @kubernetes/sig-network-misc @kubernetes/sig-scalability-misc @danwinship @wojtek-t

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 66 (58 by maintainers)

Commits related to this issue

Merge pull request #48514 from freehan/iptables-lock Automatic merge from submit-queue (batch tested with PRs 47234, 48410, 48514, 48529, 48348) expose error lock release failure from iptables util ... — committed to kubernetes/kubernetes by deleted user 7 years ago
Merge pull request #65216 from wojtek-t/log_long_iptables_operations Automatic merge from submit-queue (batch tested with PRs 65152, 65199, 65179, 64598, 65216). If you want to cherry-pick this chang... — committed to kubernetes/kubernetes by deleted user 6 years ago
Merge pull request #65179 from shyamjvs/reduce-service-endpoints-in-load-test Automatic merge from submit-queue (batch tested with PRs 65152, 65199, 65179, 64598, 65216). If you want to cherry-pick t... — committed to kubernetes/kubernetes by deleted user 6 years ago

Most upvoted comments

I think that there can be multiple reasons, I can share what I did to find a problem with the portmap plugin holding the lock.

I just patched the kubelet https://github.com/kubernetes/kubernetes/pull/85727/commits/22665fc8dd93f5623c6a00af9b834f803eaf295a to log the pid of the process, then I monitored kube-proxy to trigger an script when it finds the error message

Failed to execute iptables-restore: failed to acquire new iptables lock: timed out waiting for the condition

that dumps all the processes so you can check “who” was holding the lock

Hope this can be useful

aojea on Jan 23, 2020