kubernetes: Windows Kubernetes - Kube-Proxy panics/exits

What happened?

Kube-Proxy version: Kubernetes v1.24.10-eks-7abdb2b AWS EKS 1.24

runtime.fatalpanic (panic.go:1141) runtime
runtime.gopanic (panic.go:987) runtime
runtime.panicmem (panic.go:260) runtime
runtime.sigpanic (signal_windows.go:257) runtime
winkernel.(*Proxier).syncProxyRules (proxier.go:1495) k8s.io/kubernetes/pkg/proxy/winkernel
winkernel.(*Proxier).OnServiceSynced (proxier.go:1029) k8s.io/kubernetes/pkg/proxy/winkernel
config.(*ServiceConfig).Run (config.go:325) k8s.io/kubernetes/pkg/proxy/config
app.(*ProxyServer).Run.func3 (server.go:758) k8s.io/kubernetes/cmd/kube-proxy/app
runtime.goexit (asm_amd64.s:1598) runtime
app.(*ProxyServer).Run (server.go:758) k8s.io/kubernetes/cmd/kube-proxy/app

I pulled the 24 branch locally and built with debug so I could run it on the same instance. The error appears to be from proxier.syncProxyRules() > syncProxyRules (winkernel/proxier.go line 1495)

the hns.getLoadBalancer call is made but the load balancer etc are null etLoadBalancer doesn’t error and instead returns nill, leaving the hnsLoadBalancer property access in the next assignment (svcInfo.hnsId = hnsLoadBalancer.hnsID) to blow.

Running an older docker based ami seems fine (kube-proxy Kubernetes v1.23.16-eks-7abdb2b) as in it doesn’t panic and quit. In order to isolate it a bit more I’ve copied the kube-proxy off the EKS23 image into the EKS24; which works fine also (excluding the issue being the windows machines itself)

kubelet-config-json.txt delv_log.txt kubeconfig.txt goroutines.txt Kube-Proxy Event Logs EKS23-working-kube-proxy.log EKS24-failing-kube-proxy.log EKS24-running-kube-proxy23.log

What did you expect to happen?

Kube proxy should never panic as far as I’m aware. The service does it repeatedly without failure.

How can we reproduce it (as minimally and precisely as possible)?

Not sure to be honest; but if you require more logs etc I have the logs collected by the microsoft sdn for the instance (from https://github.com/microsoft/SDN/blob/master/Kubernetes/windows/debug/Debug.md) that I can send on request (they’re verbose as heck).

It’s an EKS 1.24 cluster in frankfurt. The datadog install is a standard helm affair running on the linux nodes alone.

helm repo add datadog-direct https://helm.datadoghq.com
helm repo update
helm upgrade --install datadog-linux datadog-direct/datadog --namespace datadog -f datadog_agent_nix.yaml

contents of datadog_agent_nix.yaml

registry: public.ecr.aws/datadog
targetSystem: "linux"

datadog:
  apiKey: xx
  appKey: xx
  clusterName: aws-eks-xx
  logLevel: error
  nonLocalTraffic: false
  site: datadoghq.eu
  kubeStateMetricsCore: 
    enabled: false
  kubeStateMetricsEnabled: false
  orchestratorExplorer:
    enabled: false
  clusterChecksRunner:
    enabled: false
  dogstatsd:
    port: 8125
    nonLocalTraffic: true
    useHostPort: true
    useSocketVolume: true
  ignoreAutoConfig: true

clusterAgent:
  replicas: 2
  createPodDisruptionBudget: true

agents:
  enabled: true
  useConfigMap: false

Anything else we need to know?

No response

Kubernetes version

Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.8-eks-ffeb93d", GitCommit:"abb98ec0631dfe573ec5eae40dc48fd8f2017424", GitTreeState:"clean", BuildDate:"2022-11-29T18:45:03Z", GoVersion:"go1.18.8", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

AWS EKS 1.24 v1.24.10-eks-7abdb2b ami-0fb48e7f71fa1482c (eu-central-1)

OS version

BuildNumber Caption OSArchitecture Version 17763 Microsoft Windows Server 2019 Datacenter 64-bit 10.0.17763

Install tools

EKS Managed Windows Worker Node Group

Container runtime (CRI) and version (if applicable)

Container Runtime: containerd://1.6.6

Related plugins (CNI, CSI, …) and versions (if applicable)

* Amazon EBS CSI Driver: v1.11.4-eksbuild.1 * Amazon VPC CNI v1.10.4-eksbuild.1 * CoreDNS v1.8.7-eksbuild.2 * KubeProxy (nix) v1.24.9-eksbuild.1

About this issue

Original URL
State: closed
Created a year ago
Comments: 18 (11 by maintainers)

Most upvoted comments

ok so … sounds like we need to push this cherry-pick also to 1.25.x ?

jayunit100 on Mar 31, 2023

I’m going to close this: For folks wanting the up to date Kube-proxy windows code w/o this bug, make sure to upgrade to 25.8/24.12 !

jayunit100 on Mar 31, 2023

ok I see, these commits will be in

1.25.8 1.24.12

So pull those latest Kube proxy and you’ll have these fixes…!!!

jayunit100 on Mar 31, 2023

^ were going to suspect 1.24.12 fixes this for now… let us know thanks @ChrisMcKee !

jayunit100 on Mar 29, 2023