kube-state-metrics: kube_pod_container_resource_requests does not adequately reflect reality in Kube

kube_pod_container_resource_requests is kind of useful, but not directly on its own. It is the most naive view of a pod possible, but misses a bunch of important nuances of the pod resource model (wearing my kube architecture hat right now):

In general, there are 3 major phases to a pod lifetime.

Pre-scheduling (no nodeName set) - resources the scheduler is trying to place (roughly the Pending phase)
Post-scheduling running phase - resources the pod has to set aside (roughly the Pending->Running phases)
Terminal pods - pods that have reached conclusion and will never be run again - the kubelet and scheduler ignore these pods. Succeeded, Failed, and Evicted pods fall into this category.

Within the running phase there are two sub parts - pod initialization, and container execution. The calculation for the request of a pod resource is max(max(init_containers), sum(containers)). The scheduler is required to schedule that value. We also need to leave room to deal with ephemeral containers or sidecars in the future.

There are several common queries I as a kube administrator would want to do that I cannot do today:

What does the scheduler think the capacity of the node is (pods that are scheduled to a node and non terminal)
How much pending capacity do I have (pods that are not yet scheduled)
What is the running capacity of a (namespace / container / set of containers / pod / set of pods)
What is the usage profile of a running pod RELATIVE to its phase and requests, taking into account initialization and execution (i.e. are my init requests sized too high)

To correctly form the queries above, the user has to join up to four sources - scheduled, pod phase (which is going to be subtly wrong because phase isn’t the true boundary), pod terminal state (which is not exposed as a single metric because of the vagaries of the pod lifecycle), and init containers (which is not available at all).

This is a pretty big capacity planning gap I’ve noticed as I’ve spent a fair amount of time tuning infrastructure and workload resource usage. I would suggest that expose a new metric that uses the rules defined on the pod API (which are not specific to the kubelet or scheduler but apply to any component) to enable those queries.

The metric might be kube_pod_resource_requests (to imply this is the correct metric to use for a normal human). I would probably suggest the following labels (we could debate the actual slicing for maximum flexibility):

lifecycle (it’s broader than phase, we’d need to establish in the Kube API some official terms but I can handle that): values Pending, Scheduled, Terminated
node (it should be trivial to query the resources the scheduler is allocating to a node): value none means unscheduled, any other value is a node name
name: pod
container: if empty, should be the sum of the pod resources according to the calculation above. Should include init containers.
container_type: always set if container is set, should be init, container, ephemeral, or sidecar (future). If not set, the allowed values are empty and current. The current value is always set to the effective max given the pods current phase (max(init, sum(containers)) during pending, based on the current running init container while the pod is initializing, sum of all containers and ephemeral containers during execution, and 0 afterwards)
type: Same values as today.

I think this metric should replace container_resource_requests in general because without the context of lifecycle the metric is not useful.

Example queries:

# How much pending CPU is not yet scheduled
sum(kube_pod_resource_requests{lifecycle="Pending",container="",type="cpu"})

# How much CPU is scheduled on a given node right now (what does the scheduler see)
sum(kube_pod_resource_requests{lifecycle="Scheduled",container="",type="cpu",node="foo"})

# How much CPU is scheduled on all nodes
sum(kube_pod_resource_requests{lifecycle="Scheduled",container="",type="cpu",node!=""})

# Show usage vs capacity requested for a given pod over the full lifecycle, showing you exactly where the over or under provisioning in any given phase is
kube_pod_resource_requests{container="current",lifecycle="Scheduled",type="cpu",name="NAME",namespace="NS"} - max by (...) (container_cpu_usage_seconds_total{name="NAME",namespace="NS",container=""})

/kind feature

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 1
Comments: 24 (19 by maintainers)

Most upvoted comments

I agree that if our goal is to reflect scheduling state then the scheduler would be the right place 🙂 .

brancz on Mar 23, 2020