agones: GameServer stuck on state Scheduled when Pod failed with reason OutOfpods

What happened:

Agones didn’t create a new Pod when a Pod failed due to reasons OutOfpods, and the GameServer stuck with state Scheduled.

What you expected to happen:

GameServer is expected to create a new Pod if a Pod fails due to reasons of OutOfpods.

How to reproduce it (as minimally and precisely as possible):

Put the following manifest in /etc/kubernetes/manifests/static-pod.manifest of the testing node.

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  namespace: kube-system
  labels:
    component: nginx
    tier: node
spec:
  hostNetwork: true
  containers:
  - name: nginx
    image: nginx:1.14.2
    imagePullPolicy: IfNotPresent
    ports:
    - containerPort: 80
    resources:
      requests:
        cpu: 100m
  priorityClassName: system-node-critical
  priority: 2000001000
  tolerations:
  - effect: NoExecute
    operator: Exists
  - effect: NoSchedule
    operator: Exists

Set Fleet replicas to pod capacity of the node
Confirm some of the gameserver pods stuck with state Pending.
Forcibly delete static-pod created from step (1) in kube-system.
- kubectl delete pod --force --grace-period=0 <static-pod-name> -n kube-system

All gameserver pods stuck with state Pending become failed with reason OutOfpods.

Anything else we need to know?:

Here is the Pod status that I reproduce.

status:
  message: 'Pod Node didn''t have enough resource: pods, requested: 1, used: 32, capacity:
    32'
  phase: Failed
  reason: OutOfpods

I created the Fleet from official document.

Environment:

Agones version: 1.20.0
Kubernetes version (use kubectl version): Client Version: v1.21.0 Server Version: v1.21.12-gke.1500
Cloud provider or hardware configuration: GKE
Install method (yaml/helm): yaml
Troubleshooting guide log(s):
Others:

About this issue

Original URL
State: open
Created 2 years ago
Comments: 49 (37 by maintainers)

Most upvoted comments

I would like to chime in here as it seems like it’s the same issue.

There is a relatively fresh kubernetes feature - https://kubernetes.io/docs/concepts/architecture/nodes/#graceful-node-shutdown It seems like it can lead to pods being transitioned into Failed state too:

Status:           Failed
Reason:           Terminated
Message:          Pod was terminated in response to imminent node shutdown.

I think controlling GameServer should indeed be moved to unhealthy here as a correct reaction.

I don’t have a concrete way to reproduce unfortunately, is we encountered an issue in production on a loaded cluster. But I can guess that this will happen if nodes shutdownGracePeriod and shutdownGracePeriodCriticalPods are not 0 (to enable the feature) but not enough to actually terminate containers inside the pod gracefully, due to them having bigger terminationGracePeriodSeconds and actually using it up.

unlightable on Dec 29, 2022