kubernetes: apiserver cannot recover after restarting apiserver in large scale cluster(5k nodes, 15w pods)

On a large cluster - 5k nodes and 15w pods(more than 500MB for pod spec). 5 master nodes, there are one apiserver and etcd on each master nodes. When restart all the apiservers, the memory of etcd members increases rapidly to about 100G. And the leader keep to be elected frequently. And in the etcd log, it shows some request takes very long time.

2018-07-02 17:17:43.986312 W | etcdserver: apply entries took too long [52.361859051s for 1 entries]
2018-07-02 17:17:43.986341 W | etcdserver: avoid queries with large range/delete range!

After investigating, kubelet and other daemonsets on each nodes use listwatch all pods with nodeName fieldSelector(the first LIST request is send with resourceVersion=0 and nodename fieldSelector). When the pod watchcache of apiserver is not ready, apiserver will forward these LIST request to etcd. Because the pods count is very large, it will take about almost 60s for pod watchcache to become ready. Also when apiserver restart, all the pod watch from kubelet and daemonsets will resend LIST request in parallel. And it will be forward to etcd, it’s a huge pressure to etcd. As to the LIST request was forward to etcd, the behavior is trying to fix the issue (https://github.com/kubernetes/kubernetes/issues/49684), (https://github.com/kubernetes/kubernetes/pull/49688) .

I tried to change this behavior in my test, let it return err when watch cache is not ready. Then before cache is ready, all these listwatch will receive 500 status code. After cache is ready, all the LIST request will send in parallel to apiserver again, although the status of etcd cluster is normal, the stress is put to apiserver. From the pprof, it’s busy on filter pods on specified node from all the 15w pods.

File: kube-apiserver
Type: cpu
Time: Jul 5, 2018 at 5:26am (-07)
Duration: 30.14s, Total samples = 3.73mins (742.04%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 164.39s, 73.51% of 223.62s total
Dropped 935 nodes (cum <= 1.12s)
Showing top 10 nodes out of 96
      flat  flat%   sum%        cum   cum%
    40.78s 18.24% 18.24%    181.28s 81.07%  k8s.io/kubernetes/pkg/registry/core/pod.GetAttrs
    25.50s 11.40% 29.64%    119.42s 53.40%  runtime.mapassign
    23.53s 10.52% 40.16%     23.53s 10.52%  runtime.procyield
    17.91s  8.01% 48.17%     17.91s  8.01%  runtime.memclrNoHeapPointers
    12.11s  5.42% 53.59%     12.11s  5.42%  runtime.futex
    11.35s  5.08% 58.66%     11.35s  5.08%  runtime.heapBitsForObject
       11s  4.92% 63.58%     27.96s 12.50%  runtime.scanobject
     8.84s  3.95% 67.53%      8.84s  3.95%  runtime.memmove
     7.12s  3.18% 70.72%      7.16s  3.20%  runtime.greyobject
     6.25s  2.79% 73.51%      6.25s  2.79%  runtime.memeqbody
(pprof) tree pod.GetAttrs
Showing nodes accounting for 176.74s, 79.04% of 223.62s total
Dropped 69 nodes (cum <= 1.12s)
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context
----------------------------------------------------------+-------------
                                           181.28s   100% |   k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage.(*SelectionPredicate).Matches
    40.78s 18.24% 18.24%    181.28s 81.07%                | k8s.io/kubernetes/pkg/registry/core/pod.GetAttrs
                                           119.07s 65.68% |   runtime.mapassign
                                            19.35s 10.67% |   runtime.makemap
                                             2.07s  1.14% |   runtime.writebarrierptr
----------------------------------------------------------+-------------
                                           119.07s   100% |   k8s.io/kubernetes/pkg/registry/core/pod.GetAttrs
    25.47s 11.39% 29.63%    119.07s 53.25%                | runtime.mapassign
                                            78.31s 65.77% |   runtime.newarray
                                            12.77s 10.72% |   runtime.typedmemmove
                                             1.85s  1.55% |   runtime.aeshashbody
                                             0.24s   0.2% |   runtime.writebarrierptr

In apiserver log, filter takes a lot of time, which occurs in parallel LIST(with fieldselector).

I0706 07:17:07.904668       7 trace.go:76] Trace[1077488077]: "cacher *api.Pod: List" (started: 2018-07-06 07:16:43.341190891 +0000 UTC) (total time: 24.563439891s):
Trace[1077488077]: [14.682341166s] [14.68234064s] watchCache locked acquired
Trace[1077488077]: [15.232549756s] [550.208282ms] Listed 148431 items from cache
Trace[1077488077]: [24.563432917s] [9.330883161s] Filtered 33 items
I0706 07:17:07.906590       7 trace.go:76] Trace[1612848086]: "cacher *api.Pod: List" (started: 2018-07-06 07:16:28.629546457 +0000 UTC) (total time: 39.277009968s):
Trace[1612848086]: [2.257410319s] [2.257409725s] watchCache locked acquired
Trace[1612848086]: [39.277001772s] [37.003565308s] Filtered 33 items

The impact is that: the liveness probe of apiserver was configured

    livenessProbe:
      failureThreshold: 3
      httpGet:
        httpHeaders:
        path: /healthz
        port: 6443
        scheme: HTTPS
      initialDelaySeconds: 60
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 15

the liveness probe will fail and apiserver was keeping to killed and restart. When I change the initialDelaySeconds to 300s, all the listwatch can succeed finally, althrough the latency is more than 30s. But the watchcache of some apiservers cannot refresh in time. kubelet connect to these apiservers keep to patch node status conflict. Lots of nodes became NotReady. In apisever log, the messages as below show up.

W0706 09:57:27.988402       7 watcher.go:236] Fast watcher, slow processing. Number of buffered events: 100.Probably caused by slow dispatching events to watchers

W0706 09:50:26.808609       7 watcher.go:343] Fast watcher, slow processing. Number of buffered events: 100.Probably caused by slow decoding, user not receiving fast, or other processing logic

There will be a large scale of this kind of LIST, if it do filter for each one, especially when the total resource count is large and the query is in parallel, it will waste a lot of time. Is there a index mechanism to get a filtered list directly? the obj was stored in a index structure? If so, it will save lots of filter time, mitigate the pressure to apiserver.

Environment:

40 core for each apiserver. the cpu usage can reach to 4000%.
Kubernetes version (use kubectl version): release 1.9
max in-flight: 4000.
max mutating in-flight: 200 (default)

@smarterclayton @wojtek-t

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 3
Comments: 50 (25 by maintainers)

Commits related to this issue

Most upvoted comments

@aojea - please assign to me - I will take a look later today

wojtek-t on Nov 16, 2021