kubernetes: Containers with "m" used as abbreviation in memory limits fail to start on Linux, and consume 1e+9 too much memory on Windows

Background

https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/

Limits and requests for memory are measured in bytes. You can express memory as a plain integer or as a fixed-point integer using one of these suffixes: E, P, T, G, M, K. You can also use the power-of-two equivalents: Ei, Pi, Ti, Gi, Mi, Ki. For example, the following represent roughly the same value:

128974848, 129e6, 129M, 123Mi

Bug

If someone uses a lowercase letter, for example m, that has inconsistent behavior between Windows & Linux. m is a valid suffix for CPU but not memory.

Note: docker command line options do support m, b, k and g. docker resource constraints

On Linux, the m prefix is dropped if it’s included as part of a memory resource. This resource allocation will create an allocation of 300 bytes, not 300 Megabytes or Mebibytes. That’s too small and container startup will fail. See full error logs below.

apiVersion: v1
kind: Pod
metadata:
  name: nginx-test1
spec:
  containers:
  - name: nginx
    image: library/nginx
    resources:
      limits:
        cpu: 1
        memory: 800m
    ports:
      - containerPort: 80
  nodeSelector:
    "beta.kubernetes.io/os": linux

On Windows, the m prefix is treated the same as M. This resource allocation will be 300 Megabytes, and container startup will succeed.

apiVersion: v1
kind: Pod
metadata:
  name: iis-test2
spec:
  containers:
  - name: iis
    image: mcr.microsoft.com/windows/servercore/iis:windowsservercore-ltsc2019
    resources:
      limits:
        cpu: 1
        memory: 800m
    ports:
      - containerPort: 80
  nodeSelector:
    "beta.kubernetes.io/os": windows

This would have the same effect on resources as docker run --memory 800m -d mcr.microsoft.com/windows/servercore/iis:windowsservercore-ltsc2019 on Windows.

Proposed Fixes

There are a few different places this could be fixed. They’re not mutually exclusive, so I’m asking SIG-Node and API reviewers to consider the impact of each.

Parsing cpu & memory limits separately is not feasible as the same string parser is shared for all resource types.

Fix 1 - reject memory limits below a certain threshold in the admission controller

This would be the most consistent as the failure would happen immediately, and be consistent on both Windows & Linux. Let’s start with a strawman proposal of 10Mi as the container minimum. Windows actually needs more like 200Mi, but I think 10Mi would be enough to catch the most egregious mistakes such as “10m”.

Source Locations:

https://github.com/kubernetes/kubernetes/blob/f47da8a75d13590322a4e1df502cf6629b2c36fa/plugin/pkg/admission/limitranger/admission.go#L213

Fix 2 - reject memory limits with m in kubelet

This would fail later, but won’t involve changes to the apiserver, common APIs or admission controller. The error messages can still be made consistent between Linux & Windows, but different minimums can be enforced in each side. This could be implemented in the code that creates the container configs before passing them to CRI.

Source locations:

Linux: https://github.com/kubernetes/kubernetes/blob/33081c1f07be128a89441a39c467b7ea2221b23d/pkg/kubelet/kuberuntime/kuberuntime_container_linux.go#L38 Windows: https://github.com/kubernetes/kubernetes/blob/b39d8f4777b3a01359046d05b1de17298b818f2f/pkg/kubelet/kuberuntime/kuberuntime_container_windows.go#L34

Full error logs

Here’s what the status looks like for the failed pod:

$ kubectl get pod
NAME                                       READY   STATUS              RESTARTS   AGE
iis-test1                                  1/1     Running             0          10m
iis-test2                                  1/1     Running             0          10m
nginx-test1                                1/1     Running             0          10m
nginx-test2                                0/1     ContainerCreating   0          10m

patrick@planglx1:~/gist/a92766f23d39b442e5c8d92cb4ed05f2$ kubectl describe pod nginx-test2
Name:               nginx-test2
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               aks-nodepool1-20293662-vmss000000/10.240.0.4
Start Time:         Tue, 09 Jul 2019 23:44:09 +0000
Labels:             <none>
Annotations:        <none>
Status:             Pending
IP:                 
Containers:
  nginx:
    Container ID:   
    Image:          library/nginx
    Image ID:       
    Port:           80/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  800m
    Requests:
      cpu:        100m
      memory:     300m
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-jrvwx (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  default-token-jrvwx:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-jrvwx
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age                    From                                        Message
  ----     ------                  ----                   ----                                        -------
  Normal   Scheduled               10m                    default-scheduler                           Successfully assigned default/nginx-test2 to aks-nodepool1-20293662-vmss000000
  Warning  FailedCreatePodSandBox  10m                    kubelet, aks-nodepool1-20293662-vmss000000  Failed create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "nginx-test2": Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:293: copying bootstrap data to pipe caused \"write init-p: broken pipe\"": unknown
  Warning  FailedCreatePodSandBox  10m                    kubelet, aks-nodepool1-20293662-vmss000000  Failed create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "nginx-test2": Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:297: getting the final child's pid from pipe caused \"read init-p: connection reset by peer\"": unknown
  Warning  FailedCreatePodSandBox  5m52s (x230 over 10m)  kubelet, aks-nodepool1-20293662-vmss000000  Failed create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "nginx-test2": Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:297: getting the final child's pid from pipe caused \"EOF\"": unknown
  Normal   SandboxChanged          52s (x538 over 10m)    kubelet, aks-nodepool1-20293662-vmss000000  Pod sandbox changed, it will be killed and re-created.

Environment:

  • Kubernetes version (use kubectl version): v1.14.0
  • Cloud provider or hardware configuration: AKS

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 2
  • Comments: 30 (27 by maintainers)

Most upvoted comments

Woops, never mind. It’s in the docker and OCI runtime specs. Min supported 4MB ignore my comment.

Are you sure it’s turned into 800 bytes on linux? I’d expect this to be rounded to either 1 byte or 0 bytes if it’s handled correctly.

I think it rounds to the nearest byte, and below a certain threshold (like 4MB), containers will just not start

/remove-sig api-machinery

cc @kubernetes/api-reviewers

See previous discussion in https://github.com/kubernetes/kubernetes/issues/49442

Tightening pod API validation is not generally feasible, since that can cause existing persisted data to be treated as invalid, and in some cases, not be able to be updated in order to be deleted. See https://github.com/kubernetes/kubernetes/issues/64841

If the windows CRI cannot allocate sub-megabyte memory values, I would expect an attempt to do so to either get as close as possible (e.g. 1MB) or error.