kubernetes: Pod Garbage collector fails to clean up PODs from nodes that are not running anymore

What happened?

This happened after upgrading to Kubernetes 1.26 from 1.24

The succession of events we have observed:

  • Our workloads use HorizontalPodAutoscaler to scale up due to traffic increase
  • Cluster Autoscaler provisions new nodes to accommodate the new replicas
  • Traffic goes down
  • HPA downsize replicas
  • Cluster Autoscaler notices nodes under the threshold of usage and PODs can be accommodated into other nodes
  • Nodes are drained and downscaled
  • Kubernetes still reports PODs in a “Running” or “Terminating” state in the node that no longer exists
  • Kubernetes control plane reports “Orphan pods” on its audit logs

What did you expect to happen?

Kubernetes Garbage collector to clean up these PODs after the node is gone

How can we reproduce it (as minimally and precisely as possible)?

Deploy the following yaml deployments into a K8s cluster 1.26 and terminate one of the nodes

(duplicated port key “containerPort + protocol”)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app.kubernetes.io/name: nginx
    app.kubernetes.io/instance: nginx
    app.kubernetes.io/component: deployment
spec:
  replicas: 100
  selector:
    matchLabels:
      app.kubernetes.io/name: nginx
      app.kubernetes.io/instance: nginx
      app.kubernetes.io/component: deployment
  template:
    metadata:
      labels:
        app.kubernetes.io/name: nginx
        app.kubernetes.io/instance: nginx
        app.kubernetes.io/component: deployment
    spec:
      restartPolicy: Always
      containers:
        - name: nginx
          image: "nginx:latest"
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 8080
              name: http
              protocol: TCP
            - containerPort: 8080
              name: metrics
              protocol: TCP

(Duplicated environment variable. Set twice by mistake)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-2
  labels:
    app.kubernetes.io/name: nginx-2
    app.kubernetes.io/instance: nginx-2
    app.kubernetes.io/component: deployment
spec:
  replicas: 100
  selector:
    matchLabels:
      app.kubernetes.io/name: nginx-2
      app.kubernetes.io/instance: nginx-2
      app.kubernetes.io/component: deployment
  template:
    metadata:
      labels:
        app.kubernetes.io/name: nginx-2
        app.kubernetes.io/instance: nginx-2
        app.kubernetes.io/component: deployment
    spec:
      restartPolicy: Always
      containers:
        - name: nginx-2
          image: "nginx:latest"
          imagePullPolicy: IfNotPresent
          env:
            - name: MY_VAR
              value: value-1
            - name: MY_VAR
              value: value-2

Anything else we need to know?

(Duplicated port error)

{
  "kind": "Event",
  "apiVersion": "audit.k8s.io/v1",
  "level": "RequestResponse",
  "auditID": "9d3c7dbf-f599-422b-866c-84d52f3b1a22",
  "stage": "ResponseComplete",
  "requestURI": "/api/v1/namespaces/app-b/pods/app-b-5894548cb-7tssd/status?fieldManager=PodGC\u0026force=true",
  "verb": "patch",
  "user":
    {
      "username": "system:serviceaccount:kube-system:pod-garbage-collector",
      "uid": "f099fed7-6a3d-4a3b-bc1b-49c668276d76",
      "groups":
        [
          "system:serviceaccounts",
          "system:serviceaccounts:kube-system",
          "system:authenticated",
        ],
    },
  "sourceIPs": ["172.16.38.214"],
  "userAgent": "kube-controller-manager/v1.26.4 (linux/amd64) kubernetes/4a34796/system:serviceaccount:kube-system:pod-garbage-collector",
  "objectRef":
    {
      "resource": "pods",
      "namespace": "app-b",
      "name": "app-b-5894548cb-7tssd",
      "apiVersion": "v1",
      "subresource": "status",
    },
  "responseStatus":
    {
      "metadata": {},
      "status": "Failure",
      "message": 'failed to create manager for existing fields: failed to convert new object (app-b/app-b-5894548cb-7tssd; /v1, Kind=Pod) to smd typed: .spec.containers[name="app-b"].ports: duplicate entries for key [containerPort=8082,protocol="TCP"]',
      "code": 500,
    },
  "requestObject":
    {
      "kind": "Pod",
      "apiVersion": "v1",
      "metadata":
        {
          "name": "app-b-5894548cb-7tssd",
          "namespace": "app-b",
        },
      "status":
        {
          "phase": "Failed",
          "conditions":
            [
              {
                "type": "DisruptionTarget",
                "status": "True",
                "lastTransitionTime": "2023-05-23T17:00:55Z",
                "reason": "DeletionByPodGC",
                "message": "PodGC: node no longer exists",
              },
            ],
        },
    },
  "responseObject":
    {
      "kind": "Status",
      "apiVersion": "v1",
      "metadata": {},
      "status": "Failure",
      "message": 'failed to create manager for existing fields: failed to convert new object (app-b/app-b-5894548cb-7tssd; /v1, Kind=Pod) to smd typed: .spec.containers[name="app-b"].ports: duplicate entries for key [containerPort=8082,protocol="TCP"]',
      "code": 500,
    },
  "requestReceivedTimestamp": "2023-05-23T17:00:55.648887Z",
  "stageTimestamp": "2023-05-23T17:00:55.652513Z",
  "annotations":
    {
      "authorization.k8s.io/decision": "allow",
      "authorization.k8s.io/reason": 'RBAC: allowed by ClusterRoleBinding "system:controller:pod-garbage-collector" of ClusterRole "system:controller:pod-garbage-collector" to ServiceAccount "pod-garbage-collector/kube-system"',
    },
}

(Duplicated env var error)

{
  "kind": "Event",
  "apiVersion": "audit.k8s.io/v1",
  "level": "RequestResponse",
  "auditID": "9ffc9212-3d74-4b86-98bb-5e6f0c5395b1",
  "stage": "ResponseComplete",
  "requestURI": "/api/v1/namespaces/app-a/pods/app-a-7b7ddc5874-c85hq/status?fieldManager=PodGC\u0026force=true",
  "verb": "patch",
  "user":
    {
      "username": "system:serviceaccount:kube-system:pod-garbage-collector",
      "uid": "f099fed7-6a3d-4a3b-bc1b-49c668276d76",
      "groups":
        [
          "system:serviceaccounts",
          "system:serviceaccounts:kube-system",
          "system:authenticated",
        ],
    },
  "sourceIPs": ["172.16.38.214"],
  "userAgent": "kube-controller-manager/v1.26.4 (linux/amd64) kubernetes/4a34796/system:serviceaccount:kube-system:pod-garbage-collector",
  "objectRef":
    {
      "resource": "pods",
      "namespace": "app-a",
      "name": "app-a-7b7ddc5874-c85hq",
      "apiVersion": "v1",
      "subresource": "status",
    },
  "responseStatus":
    {
      "metadata": {},
      "status": "Failure",
      "message": "failed to create manager for existing fields: failed to convert new object (app-a/app-a-7b7ddc5874-c85hq; /v1, Kind=Pod) to smd typed: errors:\n .spec.containers[name=\"app-a\"].env: duplicate entries for key [name=\"RABBITMQ_HOST\"]\n .spec.containers[name=\"app-a\"].env: duplicate entries for key [name=\"RABBITMQ_PORT\"]\n .spec.initContainers[name=\"db-migration\"].env: duplicate entries for key [name=\"RABBITMQ_HOST\"]\n .spec.initContainers[name=\"db-migration\"].env: duplicate entries for key [name=\"RABBITMQ_PORT\"]",
      "code": 500,
    },
  "requestObject":
    {
      "kind": "Pod",
      "apiVersion": "v1",
      "metadata":
        {
          "name": "app-a-7b7ddc5874-c85hq",
          "namespace": "app-a",
        },
      "status":
        {
          "phase": "Failed",
          "conditions":
            [
              {
                "type": "DisruptionTarget",
                "status": "True",
                "lastTransitionTime": "2023-05-23T17:00:55Z",
                "reason": "DeletionByPodGC",
                "message": "PodGC: node no longer exists",
              },
            ],
        },
    },
  "responseObject":
    {
      "kind": "Status",
      "apiVersion": "v1",
      "metadata": {},
      "status": "Failure",
      "message": "failed to create manager for existing fields: failed to convert new object (app-a/app-a-7b7ddc5874-c85hq; /v1, Kind=Pod) to smd typed: errors:\n .spec.containers[name=\"app-a\"].env: duplicate entries for key [name=\"RABBITMQ_HOST\"]\n .spec.containers[name=\"app-a\"].env: duplicate entries for key [name=\"RABBITMQ_PORT\"]\n .spec.initContainers[name=\"db-migration\"].env: duplicate entries for key [name=\"RABBITMQ_HOST\"]\n .spec.initContainers[name=\"db-migration\"].env: duplicate entries for key [name=\"RABBITMQ_PORT\"]",
      "code": 500,
    },
  "requestReceivedTimestamp": "2023-05-23T17:00:55.632119Z",
  "stageTimestamp": "2023-05-23T17:00:55.637338Z",
  "annotations":
    {
      "authorization.k8s.io/decision": "allow",
      "authorization.k8s.io/reason": 'RBAC: allowed by ClusterRoleBinding "system:controller:pod-garbage-collector" of ClusterRole "system:controller:pod-garbage-collector" to ServiceAccount "pod-garbage-collector/kube-system"',
    },
}

Kubernetes version

$ kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.1", GitCommit:"e4d4e1ab7cf1bf15273ef97303551b279f0920a9", GitTreeState:"clean", BuildDate:"2022-09-14T19:49:27Z", GoVersion:"go1.19.1", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"26+", GitVersion:"v1.26.4-eks-0a21954", GitCommit:"4a3479673cb6d9b63f1c69a67b57de30a4d9b781", GitTreeState:"clean", BuildDate:"2023-04-15T00:33:09Z", GoVersion:"go1.19.8", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

AWS EKS

OS version

# On Linux:
$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"

$ uname -a
Linux ip-x-x-x-x.region.compute.internal 5.15.108 #1 SMP Tue May 9 23:54:26 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Install tools

Cluster Autoscaler

Container runtime (CRI) and version (if applicable)

ContainerD

Related plugins (CNI, CSI, …) and versions (if applicable)

CNI: Cilium 1.11 CSI: AWS EBS CSI

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 39
  • Comments: 59 (49 by maintainers)

Commits related to this issue

Most upvoted comments

I’m working on this right now, the change is full of subtle nuances that I’m trying to deal with right now. I think I should be done soon, definitely will be merged for the next release, hopefully with a lot of time left for you to address the bugs its causing.

I have opened a PR to fix this issue for PodGC: https://github.com/kubernetes/kubernetes/pull/121103 by dropping SSA from the controller.

fixed are cherry-picked, will be in 1.28.4 / 1.27.8 / 1.26.11

We recently ran into this issue with a pod that has a duplicate environment variable in one of our first 1.27 test servers (we have various ways to “calculate” or allow users to customize setting env vars in our Helm chart). We shut down the nodes over night to reduce cost. In the morning, we found those pods in Terminating state. The API server logs indicated the root cause and brought us here. Aside from the actual bug, I think the handling of environment variables should be reconsidered. Since from an API perspective they are a list (I guess there was a reason for this when the podspec was designed), but in the end they become a map, there are basically two options to fix this: always raise an error when env vars (or ports or any other “this could have been a map” list entries) are duplicated, forcing users to resolve the issue, not ignore it like we did (and we didn’t even do so consciously - Gitops tools don’t care about warnings) OR at least have a well-defined order of precedence (e.g., “last specified value wins”)

Any idea when a fix will be available on 1.26?

we’ve created a cronjob that deletes pods that are stuck in Terminating (such pods have deletion_timestamp set in metadata) to get cluster cleaner. But the issue that impacts us is pods that suck in Running state while the node has been removed. In this case, deployment is not able to spinup a new replica. Any recommendations on how to deal with/discover such pods?

For now, we are mitigating this with a Cronjob that runs every 10m with this Python code

requirements.txt

kubernetes==26.1.0 ; python_version >= "3.11" and python_version < "4.0"
loguru==0.6.0 ; python_version >= "3.11" and python_version < "4.0"

garbage_collector__init__.py empty file

garbage_collector/__main__.py

from garbage_collector.kubernetes_client import get_core_v1_api, get_orphan_pods, kill_pods

if __name__ == "__main__":
    v1 = get_core_v1_api()
    pods = get_orphan_pods(v1)
    kill_pods(v1, pods)

garbage_collector/kubernetes_client.py

from typing import List

from kubernetes import client, config
from kubernetes.client.models import V1NodeList, V1Pod, V1PodList
from kubernetes.client.rest import ApiException
from loguru import logger


def get_core_v1_api() -> client.CoreV1Api:
    config.load_incluster_config()
    return client.CoreV1Api()


def get_orphan_pods(v1_client: client.CoreV1Api) -> List[V1Pod]:
    """
    Pods that seem to be running in nodes that don't exist anymore
    """
    nodes: V1NodeList = v1_client.list_node(watch=False)
    all_node_names = [no.metadata.name for no in nodes.items]
    logger.info(f"Nodes found: {all_node_names}")
    pods: V1PodList = v1_client.list_pod_for_all_namespaces(watch=False)
    return [po for po in pods.items if po.spec.node_name not in all_node_names]


def kill_pods(v1_client: client.CoreV1Api, pods: List[V1Pod]) -> None:
    logger.info(f"Orphan PODs found: {len(pods)}")
    for pod_to_kill in pods:
        # Create an instance of the API class
        name = pod_to_kill.metadata.name
        namespace = pod_to_kill.metadata.namespace

        logger.info(f"Killing {namespace}/{name}")
        try:
            v1_client.delete_namespaced_pod(name, namespace, grace_period_seconds=0)
            logger.info(f"Killed {namespace}/{name}")
        except ApiException as k8s_exception:
            logger.warning(
                f"Exception when calling CoreV1Api->delete_namespaced_pod: {k8s_exception}",
            )

In pod failure policy (and disruption conditions for that matter) it was a deliberate decision to not proceed with DELETE if the PATCH fails. It could be considered a bug if the condition isn’t added, given the KEP Design Details, and the user-facing docs. Not adding the condition would also undermine the usefulness of the pod failure policy.

When the PATCH request fails, the controller, such as PodGC, will retry. If the reason for the failure is transient it would succeed eventually (ofc transient failures could also happen to the DELETE request itself).

I think the issue is that the PATCH fails permanently due to inconsistency between validation (allowing for duplicated env. vars), and what SSA can handle (which cannot process requests with the duplicated entries). Thus, IIUC, the bug in SSA or the validation code. IMO, this is a validation issue, because probably duplicated entries user’s mistake, that is just not picked up by validation. Still, maybe it could be supported by SSA nevertheless.

If this issue cannot be resolved at the SSA level or validation, then I see three options:

  1. Accept as known issue (add to known issues( and improve documentation (add warning to using PodDisruptionConditions)
  2. Replace SSA with regular patches, while against the long-term plan of k8s migration to SSA, it would solve this scenario for a time-being
  3. Introduce a knob (either in control plane or per-pod) to proceed with delete in case of patch failing (potentially after a couple of retries, so that the condition isn’t dropped in case of a single dropped PATCH request), but this may result in worst throughput and other user complains, if the root cause isn’t fixed.

/cc @alculquicondor

What you suggest is tracked in #113482, but the solution is not trivial. Your proposal by itself is not backwards compatible.

@liggitt any recommendation how to proceed?

Sorry, I just caught up here.

The SSA fix for this is likely to be complex / subtle as @apelisse noted, and is not something I think we should backport to 1.26/1.27/1.28 (which is where this issue is surfacing).

At first glance, I would do one of the following for 1.26/1.27/1.28:

  • disable the PodDisruptionConditions feature gate by default
  • make the things adding the condition proceed deleting even if it can’t add the condition using SSA
  • switch the things adding the condition to use another method (update or jsonpatch or something) instead of SSA

Bumping the thread with cross referencing another issue report on that: ~https://github.com/kubernetes/kubernetes/issues/118261~ https://github.com/kubernetes/kubernetes/issues/118741.

@kubernetes/sig-api-machinery-members we would like to get input on the preferred way of fixing this that could be cherry-picked down to 1.26.

As a potential band-aid that might be feasible to backport, could we cordon off the problem case further and limit failures to only requests that are actually conflicting in a way we can’t handle without major changes?

I’m imagining this being a rule like: Allow apply operations to listType=maps with duplicates when (a) the apply configuration keys do not intersect with any duplicated keys and (b) the field manager does not have field ownership of the duplicated keys.

This doesn’t fix the problem, but it dramatically narrows down which server side apply requests will be impacted by this.

Note that it is impossible to create duplicate key entries in a listType=map with server side apply, so in practice, I think this would work really well. Only field manager that are making both update and apply requests, or that make requests with keys that conflict with other field managers, are are still at risk of hitting the problem once this band-aid is applied.

We should fix all k8s controllers that use SSA, for the purpose of this issue.

Looking at the current validation code (as of v1.27), you shouldn’t be able to create these Pods. I’m trying to check whether this validation was added in v1.25 or v1.26, but it looks like it was there before.