Flatcar: Flatcar Container Linux fails and reboots: "kernel BUG at net/core/skbuff.c"

Description

On AWS EC2 instances using Flatcar Container Linux versions 2765.1.0 and 2801.1.0 from the Beta channel, with a kOps-provisioned Kubernetes installation on top, we encounter a kernel bug that causes the machines to stop and reboot immediately.

The log entries in journalctl appear as follows:

Apr 06 17:59:41 ip-10-2-1-63.eu-west-1.compute.internal kernel: ------------[ cut here ]------------
Apr 06 17:59:41 ip-10-2-1-63.eu-west-1.compute.internal kernel: kernel BUG at net/core/skbuff.c:4008!
-- Boot 8314fb086d5b4ed0a9e80895ab0c4f0b --
Apr 06 17:59:59 localhost kernel: Linux version 5.10.25-flatcar (build@pony-truck.infra.kinvolk.io) (x86_64-cros-linux-gnu-gcc (Gentoo Hardened 9.3.0-r1 p3) 9.3.0, GNU ld (Gentoo 2.35 p1) 2.35.0) #1 SMP Wed Mar 24 14:51:21 ->
lines 3257-3278

Sometimes the line number in file net/core/skbuff.c is 3,996 instead of 4,008. Usually we’ll see 3,996 cited, then after the machine reboots, thereafter we’ll see 4,008, suggesting that the rebooting swapped some updated files into place.

Note that we have locksmithd disabled, but update-engine is enabled, so we’re downloading updates but not putting them into use eagerly.

Impact

Our fleet of Kubernetes cluster machines reboot periodically, causing the containers running on them to exit without warning and be replaced (in most cases) by the kubelet after a short delay.

Environment and steps to reproduce

  1. Set-up:
  • AWS EC2 in the “eu-west-1” region, though we’ve seen these a few of failures in the “us-east-2” region as well.
  • Instance types we’ve seen fail:
    • m5.xlarge
    • m5.2xlarge
    • m5.4xlarge
    • m5a.2xlarge
    • c5.xlarge
  • Cluster provisioned by kOps version 1.19.1
  • Kubernetes versions 1.19.8 and 1.19.9
  • Cluster CNI: Calico version 3.17.3 and 3.18.1
  1. Task:
  • Kubernetes is running either control plane or worker node responsibilities.
  • We have not seen this failure occur on bastion machines (instance type t3.micro) that don’t run any Kubernetes components.
  1. Action(s):
    a. Launch an EC2 instance using Flatcar Container Linux, perhaps via a supervising ASG. b. Allow various Kubernetes components to start (e.g. kubelet, CNI daemons). c. Periodically check the machine’s last boot time. d. Inspect system logs with a command like journalct --grep=skbuff.
  2. Error:
    The machine will hum along normally, downloading updates occasionally, and running containers for Kubernetes workload. With no warning, the machine will reboot. Subsequent inspection of the log via journalctl shows a message like this:
kernel: kernel BUG at net/core/skbuff.c:3996!

One variation:

kernel: kernel BUG at net/core/skbuff.c:4008!

After the machine boots, the /sys/fs/pstore directory mentioned here exists, but is empty. The “pstore” mount entry is as follows:

pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime,seclabel)

Perhaps our hardware does not support pstore, per the following uname -a output:

Linux ip-10-2-1-63.eu-west-1.compute.internal 5.10.25-flatcar #1 SMP Wed Mar 24 14:51:21 -00 2021 x86_64 Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz GenuineIntel GNU/Linux

Expected behavior The machine should continue running normally without encountering errors that cause it to reboot without warning.

Additional information We run similar Kubernetes cluster in several other AWS regions:

  • ap-northeast-1
  • ap-southeast-1
  • us-west-2

We have not seen this failure occur in those regions. We see it predominantly in “eu-west-1” and occasionally in “us-east-2.” That could be due to more intense workload in the clusters in the former region.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 45 (13 by maintainers)

Most upvoted comments

The patch is queued up in netdev/next - as soon as it lands in linus’ tree it can be submitted to stable. https://lore.kernel.org/netdev/166753501670.4086.1819802414418539212.git-patchwork-notify@kernel.org/#t

We’ve been using this fix for about six weeks now with noticing any of these failures occurring. I consider this problem to be fixed. Thank you for all of your help with this one. It was quite a journey.

This patch is in 5.15.79, which is in beta as of yesterday (3417.1.0).

@seh, want to verify and then we’ll close this issue at last?

@jepio has built patches images: https://bincache.flatcar-linux.net/images/amd64/3346.1.99+issue-378-fix/ and @seh is testing them, maybe for others following that also may be interesting

Just to make sure we’re following along on this side, did you all see Jiri’s candidate patch that he mentioned in https://github.com/projectcalico/calico/issues/6865#issuecomment-1286936333?

same issue kernel BUG at net/core/skbuff.c:4082 on Red Hat Enterprise Linux release 8.6 (Ootpa) with 4.18.0-372.26.1.el8_6.x86_64

Calico’s eBPF data plane enabled. I agree with you @seh

Thank you for the suggestion. Yes, we’ve been testing Calico version 3.23.3 over the last couple of days together with Flatcar’s beta version 3346.1.0. So far, we haven’t been hitting this kernel bug. I’ll have more confidence after another day or two of testing.