kubernetes: Pet sets stuck in Init state though the pods are running

Hit an issue with the pet set states switching from running state to init state. Initially I thought the pods aren’t running but later figured out that the containers are actually running but the container state is not consistent with the pod state in kubernetes api.

Issue

Container State: Running
Pod State: Pending

Kubernetes Cluster Version

kubectl version
Client Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.3", GitCommit:"c6411395e09da356c608896d3d9725acab821418", GitTreeState:"clean", BuildDate:"2016-07-22T20:29:38Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.3", GitCommit:"c6411395e09da356c608896d3d9725acab821418", GitTreeState:"clean", BuildDate:"2016-07-22T20:22:25Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"linux/amd64"}

Kubernetes Get Pod State

kubectl get po -l app=zk
NAME      READY     STATUS     RESTARTS   AGE
zoo-0     0/1       Init:0/2   0          13h
zoo-1     0/1       Init:0/2   0          13h
zoo-2     0/1       Init:0/2   0          13h

Kubernetes Describe Pod State

kubectl describe po zoo-0
Name:       zoo-0
Namespace:  default
Node:       ip-1-2-3-4.region.compute.internal/10.1.2.3
Start Time: Tue, 02 Aug 2016 20:33:34 +0400
Labels:     app=zk
        name=zoo
Status:     Pending
IP:     10.24.1.2
Controllers:    PetSet/zoo
Init Containers:
  install:
    Container ID:   
    Image:      gcr.io/google_containers/zookeeper-install:0.1
    Image ID:       
    Port:       
    Args:
      --version=3.5.0-alpha
      --install-into=/opt
      --work-dir=/work-dir
    State:          Waiting
      Reason:           PodInitializing
    Ready:          False
    Restart Count:      0
    Environment Variables:  <none>
  bootstrap:
    Container ID:   
    Image:      java:openjdk-8-jre
    Image ID:       
    Port:       
    Command:
      /work-dir/peer-finder
    Args:
      -on-start="/work-dir/on-start.sh"
      -service=zk
    State:      Waiting
      Reason:       PodInitializing
    Ready:      False
    Restart Count:  0
    Environment Variables:
      POD_NAMESPACE:    default (v1:metadata.namespace)
Containers:
  zk:
    Container ID:   docker://4a2e193fa6c86559c9387a0ee473596be59cff2fb7dcd10ca9bb08c9918e6d13
    Image:      java:openjdk-8-jre
    Image ID:       docker://sha256:372859dd1c695759fe765be375346390ddd393f76fa84319630d1d64b85b9806
    Ports:      2888/TCP, 3888/TCP
    Command:
      /opt/zookeeper/bin/zkServer.sh
    Args:
      start-foreground
    State:          Running
      Started:          Tue, 02 Aug 2016 20:33:58 +0400
    Ready:          True
    Restart Count:      0
    Readiness:          exec [sh -c /opt/zookeeper/bin/zkCli.sh ls /] delay=15s timeout=5s period=10s #success=1 #failure=3
    Environment Variables:  <none>
Conditions:
  Type      Status
  Initialized   False 
  Ready     True 
  PodScheduled  True 
Volumes:
  datadir:
    Type:   PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  datadir-zoo-0
    ReadOnly:   false
  workdir:
    Type:   EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium: 
  opt:
    Type:   EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium: 
QoS Tier:   BestEffort
No events.

Container State

core@ip-1-2-3-4 ~ $ docker ps | grep zoo-0
4a2e193fa6c8        java:openjdk-8-jre                         "/opt/zookeeper/bin/z"   13 hours ago        Up 13 hours                                                    k8s_zk.484fe5b9_zoo-0_default_db0ce747-58ce-11e6-8261-02548dfad2e5_4424c4ea
1b74ed7351fd        gcr.io/google_containers/pause-amd64:3.0   "/pause"                 13 hours ago        Up 13 hours                                                    k8s_POD.85382e2a_zoo-0_default_db0ce747-58ce-11e6-8261-02548dfad2e5_18dc3c54

Node State

System Info:
 Machine ID:            ID
 System UUID:           UUID
 Boot ID:           BOOT_ID
 Kernel Version:        4.6.3-coreos
 OS Image:          CoreOS 1068.8.0 (MoreOS)
 Operating System:      linux
 Architecture:          amd64
 Container Runtime Version: docker://1.10.3
 Kubelet Version:       v1.3.3
 Kube-Proxy Version:        v1.3.3

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 12
  • Comments: 46 (23 by maintainers)

Commits related to this issue

Most upvoted comments

Just did a cluster upgrade to v1.7.2 and this is still happening. I can confirm we have a number of pods with initcontainers created by a daemonset that ran their initcontainer command successfully and are running their container command but kubectl reports them in Init:0/1.

Here’s what I’ve been using to fix up pods in our system, YMMV:

#!/bin/bash
if [[ $# -ne 2 ]]; then
  echo 'Usage: ./fix-pod.sh POD_NAME INIT_CONTAINER_NAME'
  exit 1
fi
POD_NAME=$1
INIT_CONTAINER_NAME=$2
NEW_CONTAINER_NAME=$(docker ps | grep $POD_NAME | head -n 1 | tr -s ' ' | sed 's/ /\n/g' | tail -n 1 | sed "s/k8s_[^.]\+\./k8s_$INIT_CONTAINER_NAME\./")
docker run --entrypoint /bin/bash --name $NEW_CONTAINER_NAME ubuntu:16.04

First arg is the pod name (e.g. web-server-551139053-x7cmb) Second arg is the name of the init container in the pod that is supposed to be started, (e.g. web-config-fetch). Seems to have fixed up everything so far, hope that helps someone.

Hi,

I’ve got the exact same issue with a Deployment and a initContainer exiting correctly.

NAME                                      READY     STATUS     RESTARTS   AGE
azerty-master-webserver-959416056-dgqqf   0/2       Init:0/1   0          7m

Impact could be cosmetic for some (pods still running and functioning properly) or functional for the others 😃 (scripts / automations trusting the status state)

Regards,

We are having the same problem in one of our clusters. Steps to reproduce:

  1. Create new Deployment with one init_container.
  2. Wait for it to be in Running state (init_container completed successfully and main pod is up and running)
  3. Delete init container manually using docker rm command. (When we first encountered this problem, the init container was deleted by kubernetes garbage collector).
  4. Main pod will switch to Initializing state and will be in this state indefinitely. Docker container will be running.

We are running kubernetes 1.6.3.

What’s the status on this? It seems like a major bug, I’m not sure how it has gone on for so long. When impacted by this bug, scaling, restarting, etc of pods with higher ordinals do not work on StatefulSet. That is, if you have 3 pods for your StatefulSet, and they go into Init: 0/1 state and you lose mypod-2, kubernetes will never replace it until you manually kill/restart mypod-0 and mypod-1 to remove the Init: 0/1 state.

While a fix is being worked on, the above workarounds have worked for us for 1.5 months. Look into anything that might be deleting/garbagecollecting the containers after the init containers have executed. Disable that if possible, and you’ll be unlikely to see this again.

@jad007 what do you mean “when the coreOS cluster kubernetes is running on updates itself”? Do you mean in-place upgrade Kubelet and docker binaries on the node? I noticed all following failure cases are reported on CoreOS. That might be the cause here.

As an update to my previous comment, the pods start up in the expected state, so they will be in the “Running” phase and “ready=true”. However, when the pod is restarted (for example when the coreOS cluster kubernetes is running on updates itself), the pods start up in the bad state with the “Pending” phase, but the container is running and ready.

We have also seen this problem with pods. in 1.3.9.

The pod has a init container. It looks like the init container actually completes successfully and later on the container starts and runs correctly ( with the output of the init container). However the pod state is stuck at pending.

I will follow up with more information and try to isolate the problem in a way that I can share in this thread.