kubernetes: /stats/summary endpoints failing with "Context Deadline Exceeded" on Windows nodes with very high CPU usage

What happened: Windows nodes running with high (>90%) CPU usage are having calls to /stats/summary time out with “Context Deadline Exceeded”.

When this happens metrics-server cannot gather accurate CPU/memory usage for the node or pods running on the node which results in

HPA not functioning
kubectl top node/pod returning <unknown> instead of values

What you expected to happen: Windows nodes should be able to run with high CPU utilization without negatively impacting Kubernetes functionality

How to reproduce it (as minimally and precisely as possible):

Create a cluster with Windows Server 2019 nodes
Schedule CPU intensive workloads into cluster
- ex: https://gist.github.com/marosset/ad8a9769bdbaa5229fb67cf249e2c1ff#file-iis-2019-burn-cpu-yaml
Scale workload to put node into high CPU state
Query node metrics with kubectl top node

Anything else we need to know?: This issue appears to be easier to reproduce on 1.18.x nodes vs 1.16.x or 1.17.x nodes

Ex:

metrics-server logs show E1020 10:58:24.539304 1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:akswin000000: unable to fetch metrics from Kubelet akswin000000 (10.240.0.115): Get https://10.240.0.115:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded

Kubelet logs show

E1002 00:26:59.894818    4680 remote_runtime.go:495] ListContainerStats with filter &ContainerStatsFilter{Id:,PodSandboxId:,LabelSelector:map[string]string{},} 
from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded

E1002 00:43:44.801208    4680 handler.go:321] HTTP InternalServerError serving /stats/summary: Internal Error: failed to list pod stats: 
failed to list all container stats: rpc error: code = Unknown desc = operation timeout: context deadline exceeded

and

E1002 00:46:31.844828    4680 remote_runtime.go:495] ListContainerStats with filter &ContainerStatsFilter{Id:,PodSandboxId:,LabelSelector:map[string]string{},} 
from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded

Environment:

Kubernetes version (use kubectl version): 1.18.8
Cloud provider or hardware configuration: AKS / Azure
OS (e.g: cat /etc/os-release): WIndows Server 2019
Kernel (e.g. uname -a):
Install tools:
Network plugin and version (if this is a network-related bug):
Others: Docker EE version 19.3.11

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 6
Comments: 21 (18 by maintainers)

Most upvoted comments

docker/containerd should probably have a flag that allows them to run as high priory as well but would likely be different issues in the respective projects

jsturtevant on Oct 29, 2020

I was able to reproduce context deadline exceeded on 1.18.8 during a scale operation: kubectl scale --replicas 9 -n default deployment iis-2019-burn-cpu where it fully subscribes all CPU’s in Allocatable:

describe node 3654k8s00000000
Capacity:                                                                                       
  cpu:                            2                                           
  ephemeral-storage:              30945276Ki                                                                                                                 
  memory:                         8388148Ki                                                                                                                  
  pods:                           30                                          
Allocatable:                                                                                             
  cpu:                            2                                           
  ephemeral-storage:              30945276Ki                                                                                                                 
  memory:                         6290996Ki
<clip>
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                       Requests    Limits
  --------                       --------    ------
  cpu                            2 (100%)    2 (100%)
  memory                         2400m (0%)  6400m (0%)
  ephemeral-storage              0 (0%)      0 (0%)
  attachable-volumes-azure-disk  0           0

k top nodes
NAME                                 CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%     
k8s-agentlinux-36542735-vmss000000   62m          3%     920Mi           12%         
k8s-master-36542735-0                170m         8%     2255Mi          31%         
3654k8s00000000                      <unknown>                           <unknown>               <unknown>               <unknown>

Metric server logs:

 23:24:22.986388       1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:3654k8s00000000: [unable to get CPU for container "iis" in pod default/iis-2019-burn-cpu-56b7bcf8b5-2ndl4 on node "3654k8s00000000", discarding data: missing cpu usage metric, unable to get CPU for container "iis" in pod default/iis-2019-burn-cpu-56b7bcf8b5-skqhv on node "3654k8s00000000", discarding data: missing cpu usage metric, unable to get CPU for container "iis" in pod default/iis-2019-burn-cpu-56b7bcf8b5-446lb on node "3654k8s00000000", discarding data: missing cpu usage metric, unable to get CPU for container "iis" in pod default/iis-2019-burn-cpu-56b7bcf8b5-zt2x4 on node "3654k8s00000000", discarding data: missing cpu usage metric]
E1020 23:25:27.264682       1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:3654k8s00000000: unable to fetch metrics from Kubelet 3654k8s00000000 (3654k8s00000000): Get https://3654k8s00000000:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded

I do not see “context deadline” in kubelet but I did find some logs assocated:

> I1020 23:24:54.460804    5052 kubelet.go:1953] SyncLoop (PLEG): "iis-2019-burn-cpu-56b7bcf8b5-zt2x4_default(10d79830-998d-4664-a95f-3d417d5d4a38)", event: &pleg.PodLifecycleEvent{ID:"10d79830-998d-4664-a95f-3d417d5d4a38", 
Type:"ContainerStarted", Data:"357eb5a243edf86b91f41b7e84765b07e0e3e402c7f58e2f4f4d8f30eb9a6131"}                                                                                                                                 
  E1020 23:25:28.857884    5052 handler.go:309] Error writing response: http2: stream closed                                                                                                                                      
  time="2020-10-20T23:25:40Z" level=error msg="failed to waitForNotification: callbackNumber 12470 does not exist in callbackMap"

Once the pods are scheduled the node remains at 100% cpu and I no longer see errors but kubectl top nodes is not accurate:

 kubectl top nodes
NAME                                 CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%  
3654k8s00000000                      1200m        60%    1704Mi          27%

on the node (run after the pod deployment stabilized)

GET-COUNTER -Counter "\Processor(_Total)\% Processor Time" -SampleInterval 2 -MaxSamples 3

Timestamp                 CounterSamples
---------                 --------------
10/21/2020 3:43:10 PM     \\3654k8s00000000\processor(_total)\% processor time :
                          100


10/21/2020 3:43:46 PM     \\3654k8s00000000\processor(_total)\% processor time :
                          99.9556389707573


10/21/2020 3:43:54 PM     \\3654k8s00000000\processor(_total)\% processor time :
                          100

jsturtevant on Oct 21, 2020