amazon-eks-ami: Containers fail to create and probe exec errors related to seccomp on recent kernel-5.10 versions

What happened:

After upgrading EKS nodes from v20230203 to v20230217 on our 1.24 EKS clusters after a few days a number of the nodes have containers stuck in ContainerCreating state or liveness/readiness probes reporting the following error:

Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "4a11039f730203ffc003b7e64d5e682113437c8c07b8301771e53c710a6ca6ee": OCI runtime exec failed: exec failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown

This issue is very similar to https://github.com/awslabs/amazon-eks-ami/issues/1179. However, we had not been seeing this issue on previous AMIs and it only started to occur on v20230217 (following the upgrade from kernel 5.4 to 5.10) with no other changes to the underlying cluster or workloads.

We tried the suggestions from that issue (sysctl net.core.bpf_jit_limit=452534528) which helped to immediately allow containers to be created and probes to execute but after approximately a day the issue returned and the value returned by cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}' was steadily increasing.

What you expected to happen:

  • Containers to launch successfully and become Ready
  • Liveness an readiness probes to execute successfully

How to reproduce it (as minimally and precisely as possible):

I don’t currently have a reproduction that I can share due to my current one using some internal code (I can hopefully produce a more generic one if required when I get a chance).

As a starting point we only noticed this happening on nodes that had pods scheduled on them which had an exec liveness & readiness probe running every 10 seconds that performs a health check against a gRPC service using grpcurl. In addition to this we also have a default Pod Security Policy (yes we know they are deprecated 😄) that has the following annotation seccomp.security.alpha.kubernetes.io/defaultProfileName: docker/default.

These two conditions seem to be enough to trigger this issue and the values reported by cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}' will steadily increase over time until containers can no longer be created on the node.

Anything else we need to know?:

Environment:

  • AWS Region: Multiple
  • Instance Type(s): Mix of x86_64 and arm64 instances of varying sizes
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): "eks.4"
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): "1.24"
  • AMI Version: v20230217
  • Kernel (e.g. uname -a): 5.10.165-143.735.amzn2.x86_64 #1 SMP Wed Jan 25 03:13:54 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Release information (run cat /etc/eks/release on a node):
BASE_AMI_ID="ami-09bffa74b1e396075"
BUILD_TIME="Fri Feb 17 21:59:10 UTC 2023"
BUILD_KERNEL="5.10.165-143.735.amzn2.x86_64"
ARCH="x86_64"

Official Guidance

Kubernetes pods using SECCOMP filtering on EKS optimized AMIs based on Linux Kernel version 5.10.x may get stuck in ContainerCreating state or their liveness/readiness probes fail with the following error:

unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524

When a process with SECCOMP filters creates a child process, the same filters are inherited and applied to the new process. The Amazon Linux kernel versions 5.10.x are affected by a memory leak that occurs when parent process is terminated while creating a child process. When the total amount of memory allocated for SECCOMP filter is over the limit, a process cannot create a new SECCOMP filter. As a result, the parent process fails to create a new child process and the above error message will be logged.

This issue is more likely to be encountered with kernel versions kernel-5.10.176-157.645.amzn2 and kernel-5.10.177-158.645.amzn2 where the rate of the memory leak is higher.

Amazon Linux will be releasing the fixed kernel by May 1st, 2023. We will be releasing a new set of EKS AMIs with the updated kernel latest by May 3rd, 2023.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 37
  • Comments: 54 (19 by maintainers)

Commits related to this issue

Most upvoted comments

The v20230501 release has started now, and it includes 5.10.178-162.673.amzn2.x86_64 for all AMIs that use 5.10 kernels. We have tested the kernel and expect it to resolve this issue for customers. New AMIs should be available in all regions late tonight (PDT).

Official guidance:

Kubernetes pods using SECCOMP filtering on EKS optimized AMIs based on Linux Kernel version 5.10.x may get stuck in ContainerCreating state or their liveness/readiness probes fail with the following error:

unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524

When a process with SECCOMP filters creates a child process, the same filters are inherited and applied to the new process. The Amazon Linux kernel versions 5.10.x are affected by a memory leak that occurs when parent process is terminated while creating a child process. When the total amount of memory allocated for SECCOMP filter is over the limit, a process cannot create a new SECCOMP filter. As a result, the parent process fails to create a new child process and the above error message will be logged.

This issue is more likely to be encountered with kernel versions kernel-5.10.176-157.645.amzn2 and kernel-5.10.177-158.645.amzn2 where the rate of the memory leak is higher.

Amazon Linux will be releasing the fixed kernel by May 1st, 2023. We will be releasing a new set of EKS AMIs with the updated kernel latest by May 3rd, 2023.

We created a new EKS cluster on version 1.24, After that below error started to show while containers are starting up.

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown

Any plans of reverting this to the last stable version till the time AWS finds a fix?

This ☝️

Why is a broken AMI still the default for Amazon’s managed node groups?

Can’t that be backed out or the release pulled?

Yes, it’s available. Folks that manage custom AMIs can start using the kernel and we’re preparing AMIs for release on Wednesday that will include the latest kernel.

We created a new EKS cluster on version 1.24, After that below error started to show while containers are starting up.

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown

Any plans of reverting this to the last stable version till the time AWS finds a fix?

Hey guys, appreciate you’re all subbing but think of the people that are already subbed getting all these pointless messages.

If you’re not gonna add any information that’s relevant to the resolution of the issue please refrain from sending another message and just click the subscribe button.

We’re following up on this with our kernel folks; we believe we’ve identified the necessary patches. I’ll update here once we’ve verified and have a kernel build in the pipeline.

FWIW with eksctl it is possible to pin previous version with

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: k8s
managedNodeGroups:
  - name: nodegroup
    releaseVersion: 1.24.11-20230406  # or any other from https://github.com/awslabs/amazon-eks-ami/releases
    ...

Of course this is very unfortunate bug that renders our nodes unusable within day or two even with increased bpf_jit_limit and we’re hoping for a quick fix.

@dougbaber thanks! I’ll make sure the ECS team is aware of this issue; any users of recent 5.10 kernel builds would be impacted.

After backporting a1140cb215fa (“seccomp: Move copy_seccomp() to no failure path.”) to our 5.10 kernel, I didn’t see any memleak with @essh 's repro. I will release a new kernel with the commit and post the backport patch to the upstream 5.10 tree as well.

v20230501 is available in all regions now! Update to the latest EKS Optimized AMIs and this issue should be resolved.

The same problem for this setup:

Kernel version: 5.10.176-157.645.amzn2.x86_64
Kubelet version: v1.24.11-eks-a59e1f0

On:

Kernel version: 5.4.226-129.415.amzn2.x86_64
Kubelet version: v1.24.7-eks-fb459a0

works great.

It’s non trivial to downgrade the kernel downstream when building AMI based on this upstream EKS node AMI which is on kernel 5.10

@stevo-f3 This should do it:

yum versionlock delete kernel
amazon-linux-extras disable kernel-5.10
amazon-linux-extras enable kernel-5.4
yum install -y kernel

At present, we have more users needing 5.10 who are not experiencing this leak than those who are; downgrading the official build to 5.4 would be a last resort if we can’t put a fix together.

@borkmann ACK on behalf of @cartermckinnon. please give us some time to do things…

5.4 kernel would not be affected as it does not seem to have the offending commit 3a15fb6ed92c (“seccomp: release filter after task is fully dead”) which a1140cb215fa (“seccomp: Move copy_seccomp() to no failure path.”) fixes.

Looks like potentially missing kernel commit in seccomp causing this issue: a1140cb215fa (“seccomp: Move copy_seccomp() to no failure path.”) (via https://lore.kernel.org/bpf/20230321170925.74358-1-kuniyu@amazon.com/)

Is the kernel fix actually fixing the bug for good or is it just bumping the default bpf jit memory limit? can you provide a link to the patch?

The kernel fix seems to have been released now, as 5.10.178-162.673.amzn2.x86_64.

I’ve switched to Bottlerocket