rancher: Failed to get job complete status for job rke-network-plugin-deploy-job

What kind of request is this (question/bug/enhancement/feature request): Bug

Steps to reproduce (least amount of steps as possible): Update all firmware on nodes to current. Install CentOS 7.6 v1810 - Minimal from DVD install media (written to USB flash media) to local SSDs. Run the bash script included in “Other details…”. Create cluster with specifications included in “Other details…”. Deploy ETCD/Control Plane node via command copied out of Web GUI (Included in “Other details…”). Deploy Worker nodes via command copied out of Web GUI (Included in “Other details…”).

Result:

This cluster is currently Provisioning; areas that interact directly with it will not be available until the API is ready.

Failed to get job complete status for job rke-network-plugin-deploy-job in namespace kube-system

Other details that may be helpful: Cluster Configuration:

Kubernetes Version: v1.14.1-rancher1-1
Network Provider: Weave
Cloud Provider: None (Baremetal)
Private Registry: None Specified
Authorized Cluster Endpoint: Enabled (None specified)
Advanced Cluster Options: Default except...
Docker version on nodes: Require a supported Docker version
Metrics Server Monitoring: Enabled

ETCD/CP Prep Bash Script:

#!/bin/bash
HTTPPROXY="http://********:3128"
NFSSERVER="ehs-kn-00"
REMOTENFSPATH="/media/nfs"
LOCALNFSPATH="/mnt/nfs"
TCPPORTS=(22 443 2376 2379 2380 6443 6783 9099 10250 10254)
UDPPORTS=(6783 6784 8472)
#SVCPORTSTART="30000"
#SVCPORTSTOP="32767"
FIREWALLZONE="public"
DOCKERVERSIONSTRING="18.06.3.ce-3.el7"
TASKLIST=(GenerateVars ConfigureFirewall SetProxyInfo Config_selinux IUPackages Config_YumCron ConfigureNFS AddCA ConfigureDocker RestartDatCompYo)

###################################################################################
## Don't recommend editing below this line unless you know what you are doing... ##
###################################################################################

function GenerateVars {
  NFSEXPORT=$(echo -e "$NFSSERVER:$REMOTENFSPATH"$'\t'"$LOCALNFSPATH"$'\t'"nfs"$'\t'"defaults"$'\t'"0 0")
  FSTABEXISTS=$(cat /etc/fstab | grep "$NFSEXPORT")
}

function ConfigureFirewall {
  for i in "${TCPPORTS[@]}";
    do
      FIREWALLTCPCMD="${FIREWALLTCPCMD} --add-port=$i/tcp"
  done
  for i in "${UDPPORTS[@]}";
    do
      FIREWALLUDPCMD="${FIREWALLUDPCMD} --add-port=$i/udp"
  done
  TEMP="$FIREWALLTCPCMD $FIREWALLUDPCMD --zone=$FIREWALLZONE --permanent"
  firewall-cmd $TEMP
  firewall-cmd --reload
}

function SetProxyInfo {
  if [ -n $HTTPPROXY ]; then
    export http_proxy=$HTTPPROXY
    echo "http_proxy=$HTTPPROXY" >> /etc/environment
  fi  
}

function IUPackages {
  yum install -y \
    yum-cron \
    yum-plugin-versionlock \
    epel-release \
    traceroute \
    nfs-utils \
    nano \
    wget \
    yum-utils \
    device-mapper-persistent-data \
    lvm2
#    open-vm-tools \
#    freeipmi \
#    iscsi-initiator-utils \
#    smartmontools
  yum update -y
  yum install -y \
    glances
  yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
  yum install -y --setopt=obsoletes=0 \
    docker-ce-$DOCKERVERSIONSTRING \
    docker-ce-cli-$DOCKERVERSIONSTRING \
    containerd.io \
    docker-compose
  yum versionlock add docker-ce
}

function Config_YumCron {
  sed -i 's/apply_updates = no/apply_updates = yes/' /etc/yum/yum-cron.conf
  systemctl enable yum-cron
  systemctl start yum-cron
}

function Config_selinux {
  sed -i 's/SELINUX=enforcing/SELINUX=disabled/' /etc/selinux/config
}

function ConfigureNFS {
  if [ ! -d "$LOCALNFSPATH" ]; then 
    mkdir $LOCALNFSPATH
  fi
  if [ -z "$FSTABEXISTS" ]; then
    echo "$NFSEXPORT" >> /etc/fstab
  fi
  mount "$NFSSERVER:$REMOTENFSPATH"
}

function AddCA {
  cp $LOCALNFSPATH/CA/* /etc/pki/ca-trust/source/anchors/
  update-ca-trust
}

function ConfigureDocker {
  systemctl enable docker
  systemctl start docker
}

 function RestartDatCompYo {
   reboot
 }

for i in ${TASKLIST[@]};
  do
    $i
done

Worker Prep Bash Script:

#!/bin/bash
HTTPPROXY="http://***********:3128"
NFSSERVER="ehs-kn-00"
REMOTENFSPATH="/media/nfs"
LOCALNFSPATH="/mnt/nfs"
TCPPORTS=(22 80 179 443 2376 2379 2380 6443 6783 9099 10250 10251 10252 10254 10256)
UDPPORTS=(4789 6783 6784 8472)
SVCPORTSTART="30000"
SVCPORTSTOP="32767"
FIREWALLZONE="public"
DOCKERVERSIONSTRING="18.06.3.ce-3.el7"
TASKLIST=(GenerateVars ConfigureFirewall SetProxyInfo Config_selinux IUPackages Config_YumCron ConfigureNFS AddCA ConfigureDocker RestartDatCompYo)

###################################################################################
## Don't recommend editing below this line unless you know what you are doing... ##
###################################################################################

function GenerateVars {
  NFSEXPORT=$(echo -e "$NFSSERVER:$REMOTENFSPATH"$'\t'"$LOCALNFSPATH"$'\t'"nfs"$'\t'"defaults"$'\t'"0 0")
  FSTABEXISTS=$(cat /etc/fstab | grep "$NFSEXPORT")
}

function ConfigureFirewall {
  for i in "${TCPPORTS[@]}";
    do
      FIREWALLTCPCMD="${FIREWALLTCPCMD} --add-port=$i/tcp"
  done
  for i in "${UDPPORTS[@]}";
    do
      FIREWALLUDPCMD="${FIREWALLUDPCMD} --add-port=$i/udp"
  done
  TEMP="$FIREWALLTCPCMD $FIREWALLUDPCMD --add-port=$SVCPORTSTART-$SVCPORTSTOP/tcp --add-port=$SVCPORTSTART-$SVCPORTSTOP/udp --zone=$FIREWALLZONE --permanent"
  firewall-cmd $TEMP
  firewall-cmd --reload
}

function SetProxyInfo {
  if [ -n $HTTPPROXY ]; then
    export http_proxy=$HTTPPROXY
    echo "http_proxy=$HTTPPROXY" >> /etc/environment
  fi  
}

function IUPackages {
  yum install -y \
    yum-cron \
    yum-plugin-versionlock \
    epel-release \
    traceroute \
    nfs-utils \
    nano \
    wget \
    yum-utils \
    device-mapper-persistent-data \
    lvm2 \
    open-vm-tools \
    freeipmi \
    iscsi-initiator-utils \
    smartmontools
  yum update -y
  yum install -y \
    glances
  yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
  yum install -y --setopt=obsoletes=0 \
    docker-ce-$DOCKERVERSIONSTRING \
    docker-ce-cli-$DOCKERVERSIONSTRING \
    containerd.io \
    docker-compose
  yum versionlock add docker-ce
}

function Config_YumCron {
  sed -i 's/apply_updates = no/apply_updates = yes/' /etc/yum/yum-cron.conf
  systemctl enable yum-cron
  systemctl start yum-cron
}

function Config_selinux {
  sed -i 's/SELINUX=enforcing/SELINUX=disabled/' /etc/selinux/config
}

function ConfigureNFS {
  if [ ! -d "$LOCALNFSPATH" ]; then 
    mkdir $LOCALNFSPATH
  fi
  if [ -z "$FSTABEXISTS" ]; then
    echo "$NFSEXPORT" >> /etc/fstab
  fi
  mount "$NFSSERVER:$REMOTENFSPATH"
}

function AddCA {
  cp $LOCALNFSPATH/CA/* /etc/pki/ca-trust/source/anchors/
  update-ca-trust
}

function ConfigureDocker {
  systemctl enable docker
  systemctl start docker
}

 function RestartDatCompYo {
   reboot
 }

for i in ${TASKLIST[@]};
  do
    $i
done

ETCD/CP Deploy Command:

sudo docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.2.2 --server https://ehs-kn-10.*********.ad --token ******* --ca-checksum *************** --etcd --controlplane

Worker Deploy Command:

sudo docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.2.2 --server https://ehs-kn-10.***********.ad --token ******* --ca-checksum *************** --worker

Environment information

Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): Rancher v2.2.2
Installation option (single install/HA): Single Install

Cluster information

Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Custom
Machine type (cloud/VM/metal) and specifications (CPU/memory): Metal (Workers: 16 Cores/128GB RAM; ETCD/CP: 4 Cores/16GB RAM)
Kubernetes version (use kubectl version):

Can't access kubectl, this is the configured version for the cluster as per the Web UI.
v1.14.1-rancher1-1

Docker version (use docker version):

[root@EHS-KN-13 ~]# docker version
Client:
 Version:           18.06.3-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        d7080c1
 Built:             Wed Feb 20 02:26:51 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.3-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       d7080c1
  Built:            Wed Feb 20 02:28:17 2019
  OS/Arch:          linux/amd64
  Experimental:     false

[root@EHS-KN-14 ~]# docker version
Client:
 Version:           18.06.3-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        d7080c1
 Built:             Wed Feb 20 02:26:51 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.3-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       d7080c1
  Built:            Wed Feb 20 02:28:17 2019
  OS/Arch:          linux/amd64
  Experimental:     false

Error information

This cluster is currently Provisioning; areas that interact directly with it will not be available until the API is ready.

Failed to get job complete status for job rke-network-plugin-deploy-job in namespace kube-system

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 4
Comments: 34 (3 by maintainers)

Most upvoted comments

Same issue , always need to run it twice to get the cluster created because the first run always fails with that error

+31

iahmad-khan on Aug 10, 2019

Issue still exists in RKE 1.0.0

+11

xkidro on Nov 27, 2019

Took a closer look at this on my end - I found errors in kubelet logs which suggested I was missing the rancher/pause container in my environment.

Adding rancher/pause:3.1 to the environment resolved the “Failed to get job complete status for job rke-network-plugin-deploy-job” issue for me and I was at least able to deploy the cluster. I suspect it was failing on the rke-network-plugin-deploy-job since it wasn’t able to wait for or connect to the network deployment.

CJRChang on Jul 5, 2019

I had the same issue, and these two steps solved my problem

Increase addon_job_timeout
Check node free space (at lease 15%)

In my case, one of the nodes had DiskPressure state

AliMD on May 16, 2020

Just had the same issue as other described. Launched the second rke up with the same settings and it completed successfully.

LaurentDumont on Nov 28, 2019

Decided to share one more possible issue and solution to probably save time to someone else: missing default route (what can be noticed in offline install scenarios) also can be the reason.

In such case errors are the same but any of the described solutions would not help, while it is enough to configure default route.

Sure this is a bit not an RKE issue while question to machines bootstrap and networking configuration (including DHCP server side), but still.

pa-yourserveradmin-com on Nov 25, 2020

The fix for me with the exact same issue @DarkRider1768 was to change my cluster.yml nodes[0].address from 127.0.0.1 or localhost to my computer name when doing an initial deploy on a brand new ubuntu 18.04. I don’t know if that helps…

This fix worked for me as well. I tried rerunning rke up multiple times but it continued to fail. Because I am testing on a single EC2 node I had set the node address in the config to local host IP 127.0.0.1. I changed this to the FQDN of the instance, reran rke up (on that instance) and it finished successfully.

balancespring on Dec 10, 2019

The fix for me with the exact same issue @DarkRider1768 was to change my cluster.yml nodes[0].address from 127.0.0.1 or localhost to my computer name when doing an initial deploy on a brand new ubuntu 18.04. I don’t know if that helps…

CaptEmulation on Jun 29, 2019