agones: GameServer stuck on state Scheduled when Pod failed with reason OutOfpods
What happened:
Agones didn’t create a new Pod when a Pod failed due to reasons OutOfpods, and the GameServer stuck with state Scheduled.
What you expected to happen:
GameServer is expected to create a new Pod if a Pod fails due to reasons of OutOfpods.
How to reproduce it (as minimally and precisely as possible):
- Put the following manifest in
/etc/kubernetes/manifests/static-pod.manifest
of the testing node.
apiVersion: v1
kind: Pod
metadata:
name: nginx
namespace: kube-system
labels:
component: nginx
tier: node
spec:
hostNetwork: true
containers:
- name: nginx
image: nginx:1.14.2
imagePullPolicy: IfNotPresent
ports:
- containerPort: 80
resources:
requests:
cpu: 100m
priorityClassName: system-node-critical
priority: 2000001000
tolerations:
- effect: NoExecute
operator: Exists
- effect: NoSchedule
operator: Exists
- Set Fleet replicas to pod capacity of the node
- Confirm some of the gameserver pods stuck with state Pending.
- Forcibly delete static-pod created from step (1) in kube-system.
kubectl delete pod --force --grace-period=0 <static-pod-name> -n kube-system
All gameserver pods stuck with state Pending become failed with reason OutOfpods
.
Anything else we need to know?:
Here is the Pod status that I reproduce.
status:
message: 'Pod Node didn''t have enough resource: pods, requested: 1, used: 32, capacity:
32'
phase: Failed
reason: OutOfpods
I created the Fleet from official document.
Environment:
- Agones version: 1.20.0
- Kubernetes version (use
kubectl version
): Client Version: v1.21.0 Server Version: v1.21.12-gke.1500 - Cloud provider or hardware configuration: GKE
- Install method (yaml/helm): yaml
- Troubleshooting guide log(s):
- Others:
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 49 (37 by maintainers)
I would like to chime in here as it seems like it’s the same issue.
There is a relatively fresh kubernetes feature - https://kubernetes.io/docs/concepts/architecture/nodes/#graceful-node-shutdown It seems like it can lead to pods being transitioned into Failed state too:
I think controlling
GameServer
should indeed be moved to unhealthy here as a correct reaction.I don’t have a concrete way to reproduce unfortunately, is we encountered an issue in production on a loaded cluster. But I can guess that this will happen if nodes
shutdownGracePeriod
andshutdownGracePeriodCriticalPods
are not 0 (to enable the feature) but not enough to actually terminate containers inside the pod gracefully, due to them having biggerterminationGracePeriodSeconds
and actually using it up.