kubernetes: Memory manager UnexpectedAdmissionError

What happened?

Dual socket server with 96 threads total (2242), ~192G of RAM, cpu & memory manager static policy, topologyManagerPolicy best-effort, 10Gi of RAM reserved on NUMA node 0, 1 core (2 threads) reserved on NUMA node 0 kubeadm config:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
...
cpuManagerPolicy: static
reservedSystemCPUs: 0,48
memoryManagerPolicy: Static
reservedMemory:
  - numaNode: 0
    limits:
      # systemReserved memory + evictionHard memory.available
      memory: 10340Mi
systemReserved:
  memory: 10240Mi
topologyManagerPolicy: best-effort

If I try to allocate 2 Guaranteed pod with 85Gi of RAM each, one pod is admitted, and the second one fails with UnexpectedAdmissionError even if it would fit using memory of both NUMA nodes. As I’m using deployments a new pod is recreated right away and you end up with tons of failed pod in UnexpectedAdmissionError state.

Error in the logs is:

E0918 12:16:32.162481 2865761 memory_manager.go:249] "Allocate error" err="[memorymanager] failed to find NUMA nodes to extend the current topology hint"

What did you expect to happen?

Either the pod becomes pending, or memory from both NUMA nodes is used

How can we reproduce it (as minimally and precisely as possible)?

Have some memory reserved on NUMA 0, launch 2 identical pods with memory limits really close to the max so 1 pod fit on NUMA 1 but the second doesn’t fit on NUMA 0 (the sum fits on the server)

Anything else we need to know?

Here the *_manager_state

If I launch 1 pod with 46 CPUs / 170Gi RAM, I get CPUs from 1 NUMA nodes, and memory from both NUMA nodes

# jq . <cpu_manager_state
{
  "policyName": "static",
  "defaultCpuSet": "0,24-48,72-95",
  "entries": {
    "d6c70f6d-adbe-4901-8b82-9505171b1368": {
      "mycontainer": "1-23,49-71"
    }
  },
  "checksum": 2697499474
}

# jq . <memory_manager_state 
{
  "policyName": "Static",
  "machineState": {
    "0": {
      "numberOfAssignments": 1,
      "memoryMap": {
        "hugepages-1Gi": {
          "total": 0,
          "systemReserved": 0,
          "allocatable": 0,
          "reserved": 0,
          "free": 0
        },
        "hugepages-2Mi": {
          "total": 0,
          "systemReserved": 0,
          "allocatable": 0,
          "reserved": 0,
          "free": 0
        },
        "memory": {
          "total": 99434803200,
          "systemReserved": 10842275840,
          "allocatable": 88592527360,
          "reserved": 88592527360,
          "free": 0
        }
      },
      "cells": [
        0,
        1
      ]
    },
    "1": {
      "numberOfAssignments": 1,
      "memoryMap": {
        "hugepages-1Gi": {
          "total": 0,
          "systemReserved": 0,
          "allocatable": 0,
          "reserved": 0,
          "free": 0
        },
        "hugepages-2Mi": {
          "total": 0,
          "systemReserved": 0,
          "allocatable": 0,
          "reserved": 0,
          "free": 0
        },
        "memory": {
          "total": 101409165312,
          "systemReserved": 0,
          "allocatable": 101409165312,
          "reserved": 93943582720,
          "free": 7465582592
        }
      },
      "cells": [
        0,
        1
      ]
    }
  },
  "entries": {
    "d6c70f6d-adbe-4901-8b82-9505171b1368": {
      "mycontainer": [
        {
          "numaAffinity": [
            0,
            1
          ],
          "type": "memory",
          "size": 182536110080
        }
      ]
    }
  },
  "checksum": 382005683
}

If I launch a pod with 46 CPUs / 85Gi RAM, it’s properly placed on NUMA 1

# jq . <cpu_manager_state
{
  "policyName": "static",
  "defaultCpuSet": "0-23,47-71,95",
  "entries": {
    "62f6c733-dae0-4c97-a5a9-bbb5c4757a07": {
      "mycontainer": "24-46,72-94"
    }
  },
  "checksum": 3342740372
}

# jq . <memory_manager_state 
{
  "policyName": "Static",
  "machineState": {
    "0": {
      "numberOfAssignments": 0,
      "memoryMap": {
        "hugepages-1Gi": {
          "total": 0,
          "systemReserved": 0,
          "allocatable": 0,
          "reserved": 0,
          "free": 0
        },
        "hugepages-2Mi": {
          "total": 0,
          "systemReserved": 0,
          "allocatable": 0,
          "reserved": 0,
          "free": 0
        },
        "memory": {
          "total": 99434803200,
          "systemReserved": 10842275840,
          "allocatable": 88592527360,
          "reserved": 0,
          "free": 88592527360
        }
      },
      "cells": [
        0
      ]
    },
    "1": {
      "numberOfAssignments": 1,
      "memoryMap": {
        "hugepages-1Gi": {
          "total": 0,
          "systemReserved": 0,
          "allocatable": 0,
          "reserved": 0,
          "free": 0
        },
        "hugepages-2Mi": {
          "total": 0,
          "systemReserved": 0,
          "allocatable": 0,
          "reserved": 0,
          "free": 0
        },
        "memory": {
          "total": 101409165312,
          "systemReserved": 0,
          "allocatable": 101409165312,
          "reserved": 91268055040,
          "free": 10141110272
        }
      },
      "cells": [
        1
      ]
    }
  },
  "entries": {
    "62f6c733-dae0-4c97-a5a9-bbb5c4757a07": {
      "mycontainer": [
        {
          "numaAffinity": [
            1
          ],
          "type": "memory",
          "size": 91268055040
        }
      ]
    }
  },
  "checksum": 3166358723
}

If I try to launch 2 pods with 46 CPUs / 85Gi RAM, one fails with

E0918 12:16:32.162481 2865761 memory_manager.go:249] "Allocate error" err="[memorymanager] failed to find NUMA nodes to extend the current topology hint"

If I try to launch 2 pods with 46 CPUs / 80Gi RAM, everything works

Kubernetes version

1.26.8

Cloud provider

NONE / bare-metal

OS version

Alma 8.8 base + rpm-ostree

Install tools

kubeadm

Container runtime (CRI) and version (if applicable)

containerd 1.7.5

Related plugins (CNI, CSI, …) and versions (if applicable)

No response

About this issue

Original URL
State: open
Created 9 months ago
Comments: 17 (17 by maintainers)

Most upvoted comments

NP. I won’t be here next week either, so no rush

Tal-or on Sep 28, 2023

/assign @ffromani

ndixita on Sep 20, 2023