kubernetes: /stats/summary endpoints failing with "Context Deadline Exceeded" on Windows nodes with very high CPU usage

What happened: Windows nodes running with high (>90%) CPU usage are having calls to /stats/summary time out with “Context Deadline Exceeded”.

When this happens metrics-server cannot gather accurate CPU/memory usage for the node or pods running on the node which results in

  • HPA not functioning
  • kubectl top node/pod returning <unknown> instead of values

What you expected to happen: Windows nodes should be able to run with high CPU utilization without negatively impacting Kubernetes functionality

How to reproduce it (as minimally and precisely as possible):

  1. Create a cluster with Windows Server 2019 nodes
  2. Schedule CPU intensive workloads into cluster
  3. Scale workload to put node into high CPU state
  4. Query node metrics with kubectl top node

Anything else we need to know?: This issue appears to be easier to reproduce on 1.18.x nodes vs 1.16.x or 1.17.x nodes

Ex: image

image

metrics-server logs show E1020 10:58:24.539304 1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:akswin000000: unable to fetch metrics from Kubelet akswin000000 (10.240.0.115): Get https://10.240.0.115:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded

Kubelet logs show

E1002 00:26:59.894818    4680 remote_runtime.go:495] ListContainerStats with filter &ContainerStatsFilter{Id:,PodSandboxId:,LabelSelector:map[string]string{},} 
from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E1002 00:43:44.801208    4680 handler.go:321] HTTP InternalServerError serving /stats/summary: Internal Error: failed to list pod stats: 
failed to list all container stats: rpc error: code = Unknown desc = operation timeout: context deadline exceeded

and

E1002 00:46:31.844828    4680 remote_runtime.go:495] ListContainerStats with filter &ContainerStatsFilter{Id:,PodSandboxId:,LabelSelector:map[string]string{},} 
from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded

Environment:

  • Kubernetes version (use kubectl version): 1.18.8
  • Cloud provider or hardware configuration: AKS / Azure
  • OS (e.g: cat /etc/os-release): WIndows Server 2019
  • Kernel (e.g. uname -a):
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others: Docker EE version 19.3.11

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 6
  • Comments: 21 (18 by maintainers)

Most upvoted comments

docker/containerd should probably have a flag that allows them to run as high priory as well but would likely be different issues in the respective projects

I was able to reproduce context deadline exceeded on 1.18.8 during a scale operation: kubectl scale --replicas 9 -n default deployment iis-2019-burn-cpu where it fully subscribes all CPU’s in Allocatable:

describe node 3654k8s00000000
Capacity:                                                                                       
  cpu:                            2                                           
  ephemeral-storage:              30945276Ki                                                                                                                 
  memory:                         8388148Ki                                                                                                                  
  pods:                           30                                          
Allocatable:                                                                                             
  cpu:                            2                                           
  ephemeral-storage:              30945276Ki                                                                                                                 
  memory:                         6290996Ki
<clip>
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                       Requests    Limits
  --------                       --------    ------
  cpu                            2 (100%)    2 (100%)
  memory                         2400m (0%)  6400m (0%)
  ephemeral-storage              0 (0%)      0 (0%)
  attachable-volumes-azure-disk  0           0
k top nodes
NAME                                 CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%     
k8s-agentlinux-36542735-vmss000000   62m          3%     920Mi           12%         
k8s-master-36542735-0                170m         8%     2255Mi          31%         
3654k8s00000000                      <unknown>                           <unknown>               <unknown>               <unknown>

Metric server logs:

 23:24:22.986388       1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:3654k8s00000000: [unable to get CPU for container "iis" in pod default/iis-2019-burn-cpu-56b7bcf8b5-2ndl4 on node "3654k8s00000000", discarding data: missing cpu usage metric, unable to get CPU for container "iis" in pod default/iis-2019-burn-cpu-56b7bcf8b5-skqhv on node "3654k8s00000000", discarding data: missing cpu usage metric, unable to get CPU for container "iis" in pod default/iis-2019-burn-cpu-56b7bcf8b5-446lb on node "3654k8s00000000", discarding data: missing cpu usage metric, unable to get CPU for container "iis" in pod default/iis-2019-burn-cpu-56b7bcf8b5-zt2x4 on node "3654k8s00000000", discarding data: missing cpu usage metric]
E1020 23:25:27.264682       1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:3654k8s00000000: unable to fetch metrics from Kubelet 3654k8s00000000 (3654k8s00000000): Get https://3654k8s00000000:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded

I do not see “context deadline” in kubelet but I did find some logs assocated:

> I1020 23:24:54.460804    5052 kubelet.go:1953] SyncLoop (PLEG): "iis-2019-burn-cpu-56b7bcf8b5-zt2x4_default(10d79830-998d-4664-a95f-3d417d5d4a38)", event: &pleg.PodLifecycleEvent{ID:"10d79830-998d-4664-a95f-3d417d5d4a38", 
Type:"ContainerStarted", Data:"357eb5a243edf86b91f41b7e84765b07e0e3e402c7f58e2f4f4d8f30eb9a6131"}                                                                                                                                 
  E1020 23:25:28.857884    5052 handler.go:309] Error writing response: http2: stream closed                                                                                                                                      
  time="2020-10-20T23:25:40Z" level=error msg="failed to waitForNotification: callbackNumber 12470 does not exist in callbackMap"     

Once the pods are scheduled the node remains at 100% cpu and I no longer see errors but kubectl top nodes is not accurate:

 kubectl top nodes
NAME                                 CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%  
3654k8s00000000                      1200m        60%    1704Mi          27%      

on the node (run after the pod deployment stabilized)

GET-COUNTER -Counter "\Processor(_Total)\% Processor Time" -SampleInterval 2 -MaxSamples 3

Timestamp                 CounterSamples
---------                 --------------
10/21/2020 3:43:10 PM     \\3654k8s00000000\processor(_total)\% processor time :
                          100


10/21/2020 3:43:46 PM     \\3654k8s00000000\processor(_total)\% processor time :
                          99.9556389707573


10/21/2020 3:43:54 PM     \\3654k8s00000000\processor(_total)\% processor time :
                          100