kubernetes: Pod Garbage collector fails to clean up PODs from nodes that are not running anymore
What happened?
This happened after upgrading to Kubernetes 1.26 from 1.24
The succession of events we have observed:
- Our workloads use HorizontalPodAutoscaler to scale up due to traffic increase
- Cluster Autoscaler provisions new nodes to accommodate the new replicas
- Traffic goes down
- HPA downsize replicas
- Cluster Autoscaler notices nodes under the threshold of usage and PODs can be accommodated into other nodes
- Nodes are drained and downscaled
- Kubernetes still reports PODs in a “Running” or “Terminating” state in the node that no longer exists
- Kubernetes control plane reports “Orphan pods” on its audit logs
What did you expect to happen?
Kubernetes Garbage collector to clean up these PODs after the node is gone
How can we reproduce it (as minimally and precisely as possible)?
Deploy the following yaml deployments into a K8s cluster 1.26 and terminate one of the nodes
(duplicated port key “containerPort + protocol”)
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
labels:
app.kubernetes.io/name: nginx
app.kubernetes.io/instance: nginx
app.kubernetes.io/component: deployment
spec:
replicas: 100
selector:
matchLabels:
app.kubernetes.io/name: nginx
app.kubernetes.io/instance: nginx
app.kubernetes.io/component: deployment
template:
metadata:
labels:
app.kubernetes.io/name: nginx
app.kubernetes.io/instance: nginx
app.kubernetes.io/component: deployment
spec:
restartPolicy: Always
containers:
- name: nginx
image: "nginx:latest"
imagePullPolicy: IfNotPresent
ports:
- containerPort: 8080
name: http
protocol: TCP
- containerPort: 8080
name: metrics
protocol: TCP
(Duplicated environment variable. Set twice by mistake)
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-2
labels:
app.kubernetes.io/name: nginx-2
app.kubernetes.io/instance: nginx-2
app.kubernetes.io/component: deployment
spec:
replicas: 100
selector:
matchLabels:
app.kubernetes.io/name: nginx-2
app.kubernetes.io/instance: nginx-2
app.kubernetes.io/component: deployment
template:
metadata:
labels:
app.kubernetes.io/name: nginx-2
app.kubernetes.io/instance: nginx-2
app.kubernetes.io/component: deployment
spec:
restartPolicy: Always
containers:
- name: nginx-2
image: "nginx:latest"
imagePullPolicy: IfNotPresent
env:
- name: MY_VAR
value: value-1
- name: MY_VAR
value: value-2
Anything else we need to know?
(Duplicated port error)
{
"kind": "Event",
"apiVersion": "audit.k8s.io/v1",
"level": "RequestResponse",
"auditID": "9d3c7dbf-f599-422b-866c-84d52f3b1a22",
"stage": "ResponseComplete",
"requestURI": "/api/v1/namespaces/app-b/pods/app-b-5894548cb-7tssd/status?fieldManager=PodGC\u0026force=true",
"verb": "patch",
"user":
{
"username": "system:serviceaccount:kube-system:pod-garbage-collector",
"uid": "f099fed7-6a3d-4a3b-bc1b-49c668276d76",
"groups":
[
"system:serviceaccounts",
"system:serviceaccounts:kube-system",
"system:authenticated",
],
},
"sourceIPs": ["172.16.38.214"],
"userAgent": "kube-controller-manager/v1.26.4 (linux/amd64) kubernetes/4a34796/system:serviceaccount:kube-system:pod-garbage-collector",
"objectRef":
{
"resource": "pods",
"namespace": "app-b",
"name": "app-b-5894548cb-7tssd",
"apiVersion": "v1",
"subresource": "status",
},
"responseStatus":
{
"metadata": {},
"status": "Failure",
"message": 'failed to create manager for existing fields: failed to convert new object (app-b/app-b-5894548cb-7tssd; /v1, Kind=Pod) to smd typed: .spec.containers[name="app-b"].ports: duplicate entries for key [containerPort=8082,protocol="TCP"]',
"code": 500,
},
"requestObject":
{
"kind": "Pod",
"apiVersion": "v1",
"metadata":
{
"name": "app-b-5894548cb-7tssd",
"namespace": "app-b",
},
"status":
{
"phase": "Failed",
"conditions":
[
{
"type": "DisruptionTarget",
"status": "True",
"lastTransitionTime": "2023-05-23T17:00:55Z",
"reason": "DeletionByPodGC",
"message": "PodGC: node no longer exists",
},
],
},
},
"responseObject":
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": 'failed to create manager for existing fields: failed to convert new object (app-b/app-b-5894548cb-7tssd; /v1, Kind=Pod) to smd typed: .spec.containers[name="app-b"].ports: duplicate entries for key [containerPort=8082,protocol="TCP"]',
"code": 500,
},
"requestReceivedTimestamp": "2023-05-23T17:00:55.648887Z",
"stageTimestamp": "2023-05-23T17:00:55.652513Z",
"annotations":
{
"authorization.k8s.io/decision": "allow",
"authorization.k8s.io/reason": 'RBAC: allowed by ClusterRoleBinding "system:controller:pod-garbage-collector" of ClusterRole "system:controller:pod-garbage-collector" to ServiceAccount "pod-garbage-collector/kube-system"',
},
}
(Duplicated env var error)
{
"kind": "Event",
"apiVersion": "audit.k8s.io/v1",
"level": "RequestResponse",
"auditID": "9ffc9212-3d74-4b86-98bb-5e6f0c5395b1",
"stage": "ResponseComplete",
"requestURI": "/api/v1/namespaces/app-a/pods/app-a-7b7ddc5874-c85hq/status?fieldManager=PodGC\u0026force=true",
"verb": "patch",
"user":
{
"username": "system:serviceaccount:kube-system:pod-garbage-collector",
"uid": "f099fed7-6a3d-4a3b-bc1b-49c668276d76",
"groups":
[
"system:serviceaccounts",
"system:serviceaccounts:kube-system",
"system:authenticated",
],
},
"sourceIPs": ["172.16.38.214"],
"userAgent": "kube-controller-manager/v1.26.4 (linux/amd64) kubernetes/4a34796/system:serviceaccount:kube-system:pod-garbage-collector",
"objectRef":
{
"resource": "pods",
"namespace": "app-a",
"name": "app-a-7b7ddc5874-c85hq",
"apiVersion": "v1",
"subresource": "status",
},
"responseStatus":
{
"metadata": {},
"status": "Failure",
"message": "failed to create manager for existing fields: failed to convert new object (app-a/app-a-7b7ddc5874-c85hq; /v1, Kind=Pod) to smd typed: errors:\n .spec.containers[name=\"app-a\"].env: duplicate entries for key [name=\"RABBITMQ_HOST\"]\n .spec.containers[name=\"app-a\"].env: duplicate entries for key [name=\"RABBITMQ_PORT\"]\n .spec.initContainers[name=\"db-migration\"].env: duplicate entries for key [name=\"RABBITMQ_HOST\"]\n .spec.initContainers[name=\"db-migration\"].env: duplicate entries for key [name=\"RABBITMQ_PORT\"]",
"code": 500,
},
"requestObject":
{
"kind": "Pod",
"apiVersion": "v1",
"metadata":
{
"name": "app-a-7b7ddc5874-c85hq",
"namespace": "app-a",
},
"status":
{
"phase": "Failed",
"conditions":
[
{
"type": "DisruptionTarget",
"status": "True",
"lastTransitionTime": "2023-05-23T17:00:55Z",
"reason": "DeletionByPodGC",
"message": "PodGC: node no longer exists",
},
],
},
},
"responseObject":
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "failed to create manager for existing fields: failed to convert new object (app-a/app-a-7b7ddc5874-c85hq; /v1, Kind=Pod) to smd typed: errors:\n .spec.containers[name=\"app-a\"].env: duplicate entries for key [name=\"RABBITMQ_HOST\"]\n .spec.containers[name=\"app-a\"].env: duplicate entries for key [name=\"RABBITMQ_PORT\"]\n .spec.initContainers[name=\"db-migration\"].env: duplicate entries for key [name=\"RABBITMQ_HOST\"]\n .spec.initContainers[name=\"db-migration\"].env: duplicate entries for key [name=\"RABBITMQ_PORT\"]",
"code": 500,
},
"requestReceivedTimestamp": "2023-05-23T17:00:55.632119Z",
"stageTimestamp": "2023-05-23T17:00:55.637338Z",
"annotations":
{
"authorization.k8s.io/decision": "allow",
"authorization.k8s.io/reason": 'RBAC: allowed by ClusterRoleBinding "system:controller:pod-garbage-collector" of ClusterRole "system:controller:pod-garbage-collector" to ServiceAccount "pod-garbage-collector/kube-system"',
},
}
Kubernetes version
$ kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.1", GitCommit:"e4d4e1ab7cf1bf15273ef97303551b279f0920a9", GitTreeState:"clean", BuildDate:"2022-09-14T19:49:27Z", GoVersion:"go1.19.1", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"26+", GitVersion:"v1.26.4-eks-0a21954", GitCommit:"4a3479673cb6d9b63f1c69a67b57de30a4d9b781", GitTreeState:"clean", BuildDate:"2023-04-15T00:33:09Z", GoVersion:"go1.19.8", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider
AWS EKS
OS version
# On Linux:
$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
$ uname -a
Linux ip-x-x-x-x.region.compute.internal 5.15.108 #1 SMP Tue May 9 23:54:26 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Install tools
Cluster Autoscaler
Container runtime (CRI) and version (if applicable)
ContainerD
Related plugins (CNI, CSI, …) and versions (if applicable)
CNI: Cilium 1.11
CSI: AWS EBS CSI
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 39
- Comments: 59 (49 by maintainers)
Commits related to this issue
- Don't use SSA in gcp-controller manager PodGC This change in analogous to: https://github.com/kubernetes/kubernetes/pull/121103. Required because of https://github.com/kubernetes/kubernetes/issues/1... — committed to mimowo/cloud-provider-gcp by mimowo 9 months ago
- Don't use SSA in gcp-controller manager PodGC This change in analogous to: https://github.com/kubernetes/kubernetes/pull/121103. Required because of https://github.com/kubernetes/kubernetes/issues/1... — committed to mimowo/cloud-provider-gcp by mimowo 9 months ago
I’m working on this right now, the change is full of subtle nuances that I’m trying to deal with right now. I think I should be done soon, definitely will be merged for the next release, hopefully with a lot of time left for you to address the bugs its causing.
I have opened a PR to fix this issue for PodGC: https://github.com/kubernetes/kubernetes/pull/121103 by dropping SSA from the controller.
fixed are cherry-picked, will be in 1.28.4 / 1.27.8 / 1.26.11
We recently ran into this issue with a pod that has a duplicate environment variable in one of our first 1.27 test servers (we have various ways to “calculate” or allow users to customize setting env vars in our Helm chart). We shut down the nodes over night to reduce cost. In the morning, we found those pods in Terminating state. The API server logs indicated the root cause and brought us here. Aside from the actual bug, I think the handling of environment variables should be reconsidered. Since from an API perspective they are a list (I guess there was a reason for this when the podspec was designed), but in the end they become a map, there are basically two options to fix this: always raise an error when env vars (or ports or any other “this could have been a map” list entries) are duplicated, forcing users to resolve the issue, not ignore it like we did (and we didn’t even do so consciously - Gitops tools don’t care about warnings) OR at least have a well-defined order of precedence (e.g., “last specified value wins”)
Any idea when a fix will be available on 1.26?
we’ve created a cronjob that deletes pods that are stuck in
Terminating
(such pods havedeletion_timestamp
set inmetadata
) to get cluster cleaner. But the issue that impacts us is pods that suck inRunning
state while the node has been removed. In this case, deployment is not able to spinup a new replica. Any recommendations on how to deal with/discover such pods?xref https://github.com/kubernetes-sigs/structured-merge-diff/issues/234
For now, we are mitigating this with a Cronjob that runs every 10m with this Python code
requirements.txt
garbage_collector__init__.py
empty filegarbage_collector/__main__.py
garbage_collector/kubernetes_client.py
In pod failure policy (and disruption conditions for that matter) it was a deliberate decision to not proceed with DELETE if the PATCH fails. It could be considered a bug if the condition isn’t added, given the KEP Design Details, and the user-facing docs. Not adding the condition would also undermine the usefulness of the pod failure policy.
When the PATCH request fails, the controller, such as PodGC, will retry. If the reason for the failure is transient it would succeed eventually (ofc transient failures could also happen to the DELETE request itself).
I think the issue is that the PATCH fails permanently due to inconsistency between validation (allowing for duplicated env. vars), and what SSA can handle (which cannot process requests with the duplicated entries). Thus, IIUC, the bug in SSA or the validation code. IMO, this is a validation issue, because probably duplicated entries user’s mistake, that is just not picked up by validation. Still, maybe it could be supported by SSA nevertheless.
If this issue cannot be resolved at the SSA level or validation, then I see three options:
/cc @alculquicondor
What you suggest is tracked in #113482, but the solution is not trivial. Your proposal by itself is not backwards compatible.
Sorry, I just caught up here.
The SSA fix for this is likely to be complex / subtle as @apelisse noted, and is not something I think we should backport to 1.26/1.27/1.28 (which is where this issue is surfacing).
At first glance, I would do one of the following for 1.26/1.27/1.28:
Bumping the thread with cross referencing another issue report on that: ~https://github.com/kubernetes/kubernetes/issues/118261~ https://github.com/kubernetes/kubernetes/issues/118741.
@kubernetes/sig-api-machinery-members we would like to get input on the preferred way of fixing this that could be cherry-picked down to 1.26.
As a potential band-aid that might be feasible to backport, could we cordon off the problem case further and limit failures to only requests that are actually conflicting in a way we can’t handle without major changes?
I’m imagining this being a rule like: Allow apply operations to listType=maps with duplicates when (a) the apply configuration keys do not intersect with any duplicated keys and (b) the field manager does not have field ownership of the duplicated keys.
This doesn’t fix the problem, but it dramatically narrows down which server side apply requests will be impacted by this.
Note that it is impossible to create duplicate key entries in a listType=map with server side apply, so in practice, I think this would work really well. Only field manager that are making both update and apply requests, or that make requests with keys that conflict with other field managers, are are still at risk of hitting the problem once this band-aid is applied.
We should fix all k8s controllers that use SSA, for the purpose of this issue.
Looking at the current validation code (as of v1.27), you shouldn’t be able to create these Pods. I’m trying to check whether this validation was added in v1.25 or v1.26, but it looks like it was there before.