kubernetes: Windows Kubernetes worker node throws BSoD

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened: BSoD occurs on Windows worker nodes more than two or three times a week, although this is not constant. The thing I checked with BSoD is UNEXPECTED_KERNEL_MODE_TRAP, and the related module name is NDIS.sys.

What you expected to happen: There is no kernel panic when I configure and run multiple Linux Kubernetes worker nodes.

How to reproduce it (as minimally and precisely as possible): We used KOPS to build a kernel node, kubenet for an existing kernel cluster, and Flannel Windows + L2Bridge configuration for a newly built Windows node.

Anything else we need to know?: The same problem occurred when using WinCNI, and the same problem occurs when using Flannel + L2Bridge, and it is expected that this problem will occur when an incorrect configuration request is requested to HNS.

Environment:

  • Kubernetes version (use kubectl version): Existing linux worker nodes are v1.9.4, and Windows worker nodes are v1.10.4
  • Cloud provider or hardware configuration: AWS EC2
  • OS (e.g. from /etc/os-release): Existing linux worker nodes are ‘Debian GNU/Linux 9.3 (stretch)’, and Windows worker nodes are ‘Windows Server 1803’.
  • Kernel (e.g. uname -a): Existing linux worker nodes are ‘Linux ip-x-y-z 4.9.0-5-amd64 #1 SMP Debian 4.9.65-3+deb9u2 (2018-01-04) x86_64 GNU/Linux’, and Windows worker nodes are ‘10.0.17134.137’.
  • Install tools: Existing linux worker nodes built with KOPS, and Windows nodes installed manually.
  • Others: I attach the BSoD screenshot. After restarting the instance, I will collect the memory dump and try to analyze it with WinDBG.

unexpected_kernel_mode_trap_ndis_sys

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 20 (9 by maintainers)

Most upvoted comments

@rkttu The official KB for this issue was delayed internally, as a complete fix requires changes in other critical components (VFP) + another subsequent HNS patch. However, if you have a Microsoft support engineer & business justification, we should be able to give you a private hotfix for Windows Server 1803 earlier than October 16th .

This issue will also require a patch on Windows Server 2019 which we are generating. Windows Server 2019 contains only one mitigation patch for the most common scenario of this issue, but to remove it 100% in all cases you need another patch which will be out shortly after release.