agones: GameServer stuck on state Scheduled when Pod failed with reason OutOfpods
What happened:
Agones didn’t create a new Pod when a Pod failed due to reasons OutOfpods, and the GameServer stuck with state Scheduled.
What you expected to happen:
GameServer is expected to create a new Pod if a Pod fails due to reasons of OutOfpods.
How to reproduce it (as minimally and precisely as possible):
- Put the following manifest in 
/etc/kubernetes/manifests/static-pod.manifestof the testing node. 
apiVersion: v1
kind: Pod
metadata:
  name: nginx
  namespace: kube-system
  labels:
    component: nginx
    tier: node
spec:
  hostNetwork: true
  containers:
  - name: nginx
    image: nginx:1.14.2
    imagePullPolicy: IfNotPresent
    ports:
    - containerPort: 80
    resources:
      requests:
        cpu: 100m
  priorityClassName: system-node-critical
  priority: 2000001000
  tolerations:
  - effect: NoExecute
    operator: Exists
  - effect: NoSchedule
    operator: Exists
- Set Fleet replicas to pod capacity of the node
 - Confirm some of the gameserver pods stuck with state Pending.
 - Forcibly delete static-pod created from step (1) in kube-system.
kubectl delete pod --force --grace-period=0 <static-pod-name> -n kube-system
 
All gameserver pods stuck with state Pending become failed with reason OutOfpods.
Anything else we need to know?:
Here is the Pod status that I reproduce.
status:
  message: 'Pod Node didn''t have enough resource: pods, requested: 1, used: 32, capacity:
    32'
  phase: Failed
  reason: OutOfpods
I created the Fleet from official document.
Environment:
- Agones version: 1.20.0
 - Kubernetes version (use 
kubectl version): Client Version: v1.21.0 Server Version: v1.21.12-gke.1500 - Cloud provider or hardware configuration: GKE
 - Install method (yaml/helm): yaml
 - Troubleshooting guide log(s):
 - Others:
 
About this issue
- Original URL
 - State: open
 - Created 2 years ago
 - Comments: 49 (37 by maintainers)
 
I would like to chime in here as it seems like it’s the same issue.
There is a relatively fresh kubernetes feature - https://kubernetes.io/docs/concepts/architecture/nodes/#graceful-node-shutdown It seems like it can lead to pods being transitioned into Failed state too:
I think controlling
GameServershould indeed be moved to unhealthy here as a correct reaction.I don’t have a concrete way to reproduce unfortunately, is we encountered an issue in production on a loaded cluster. But I can guess that this will happen if nodes
shutdownGracePeriodandshutdownGracePeriodCriticalPodsare not 0 (to enable the feature) but not enough to actually terminate containers inside the pod gracefully, due to them having biggerterminationGracePeriodSecondsand actually using it up.