rancher: Failed to get job complete status for job rke-network-plugin-deploy-job
What kind of request is this (question/bug/enhancement/feature request): Bug
Steps to reproduce (least amount of steps as possible): Update all firmware on nodes to current. Install CentOS 7.6 v1810 - Minimal from DVD install media (written to USB flash media) to local SSDs. Run the bash script included in “Other details…”. Create cluster with specifications included in “Other details…”. Deploy ETCD/Control Plane node via command copied out of Web GUI (Included in “Other details…”). Deploy Worker nodes via command copied out of Web GUI (Included in “Other details…”).
Result:
This cluster is currently Provisioning; areas that interact directly with it will not be available until the API is ready.
Failed to get job complete status for job rke-network-plugin-deploy-job in namespace kube-system
Other details that may be helpful: Cluster Configuration:
Kubernetes Version: v1.14.1-rancher1-1
Network Provider: Weave
Cloud Provider: None (Baremetal)
Private Registry: None Specified
Authorized Cluster Endpoint: Enabled (None specified)
Advanced Cluster Options: Default except...
Docker version on nodes: Require a supported Docker version
Metrics Server Monitoring: Enabled
ETCD/CP Prep Bash Script:
#!/bin/bash
HTTPPROXY="http://********:3128"
NFSSERVER="ehs-kn-00"
REMOTENFSPATH="/media/nfs"
LOCALNFSPATH="/mnt/nfs"
TCPPORTS=(22 443 2376 2379 2380 6443 6783 9099 10250 10254)
UDPPORTS=(6783 6784 8472)
#SVCPORTSTART="30000"
#SVCPORTSTOP="32767"
FIREWALLZONE="public"
DOCKERVERSIONSTRING="18.06.3.ce-3.el7"
TASKLIST=(GenerateVars ConfigureFirewall SetProxyInfo Config_selinux IUPackages Config_YumCron ConfigureNFS AddCA ConfigureDocker RestartDatCompYo)
###################################################################################
## Don't recommend editing below this line unless you know what you are doing... ##
###################################################################################
function GenerateVars {
NFSEXPORT=$(echo -e "$NFSSERVER:$REMOTENFSPATH"$'\t'"$LOCALNFSPATH"$'\t'"nfs"$'\t'"defaults"$'\t'"0 0")
FSTABEXISTS=$(cat /etc/fstab | grep "$NFSEXPORT")
}
function ConfigureFirewall {
for i in "${TCPPORTS[@]}";
do
FIREWALLTCPCMD="${FIREWALLTCPCMD} --add-port=$i/tcp"
done
for i in "${UDPPORTS[@]}";
do
FIREWALLUDPCMD="${FIREWALLUDPCMD} --add-port=$i/udp"
done
TEMP="$FIREWALLTCPCMD $FIREWALLUDPCMD --zone=$FIREWALLZONE --permanent"
firewall-cmd $TEMP
firewall-cmd --reload
}
function SetProxyInfo {
if [ -n $HTTPPROXY ]; then
export http_proxy=$HTTPPROXY
echo "http_proxy=$HTTPPROXY" >> /etc/environment
fi
}
function IUPackages {
yum install -y \
yum-cron \
yum-plugin-versionlock \
epel-release \
traceroute \
nfs-utils \
nano \
wget \
yum-utils \
device-mapper-persistent-data \
lvm2
# open-vm-tools \
# freeipmi \
# iscsi-initiator-utils \
# smartmontools
yum update -y
yum install -y \
glances
yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
yum install -y --setopt=obsoletes=0 \
docker-ce-$DOCKERVERSIONSTRING \
docker-ce-cli-$DOCKERVERSIONSTRING \
containerd.io \
docker-compose
yum versionlock add docker-ce
}
function Config_YumCron {
sed -i 's/apply_updates = no/apply_updates = yes/' /etc/yum/yum-cron.conf
systemctl enable yum-cron
systemctl start yum-cron
}
function Config_selinux {
sed -i 's/SELINUX=enforcing/SELINUX=disabled/' /etc/selinux/config
}
function ConfigureNFS {
if [ ! -d "$LOCALNFSPATH" ]; then
mkdir $LOCALNFSPATH
fi
if [ -z "$FSTABEXISTS" ]; then
echo "$NFSEXPORT" >> /etc/fstab
fi
mount "$NFSSERVER:$REMOTENFSPATH"
}
function AddCA {
cp $LOCALNFSPATH/CA/* /etc/pki/ca-trust/source/anchors/
update-ca-trust
}
function ConfigureDocker {
systemctl enable docker
systemctl start docker
}
function RestartDatCompYo {
reboot
}
for i in ${TASKLIST[@]};
do
$i
done
Worker Prep Bash Script:
#!/bin/bash
HTTPPROXY="http://***********:3128"
NFSSERVER="ehs-kn-00"
REMOTENFSPATH="/media/nfs"
LOCALNFSPATH="/mnt/nfs"
TCPPORTS=(22 80 179 443 2376 2379 2380 6443 6783 9099 10250 10251 10252 10254 10256)
UDPPORTS=(4789 6783 6784 8472)
SVCPORTSTART="30000"
SVCPORTSTOP="32767"
FIREWALLZONE="public"
DOCKERVERSIONSTRING="18.06.3.ce-3.el7"
TASKLIST=(GenerateVars ConfigureFirewall SetProxyInfo Config_selinux IUPackages Config_YumCron ConfigureNFS AddCA ConfigureDocker RestartDatCompYo)
###################################################################################
## Don't recommend editing below this line unless you know what you are doing... ##
###################################################################################
function GenerateVars {
NFSEXPORT=$(echo -e "$NFSSERVER:$REMOTENFSPATH"$'\t'"$LOCALNFSPATH"$'\t'"nfs"$'\t'"defaults"$'\t'"0 0")
FSTABEXISTS=$(cat /etc/fstab | grep "$NFSEXPORT")
}
function ConfigureFirewall {
for i in "${TCPPORTS[@]}";
do
FIREWALLTCPCMD="${FIREWALLTCPCMD} --add-port=$i/tcp"
done
for i in "${UDPPORTS[@]}";
do
FIREWALLUDPCMD="${FIREWALLUDPCMD} --add-port=$i/udp"
done
TEMP="$FIREWALLTCPCMD $FIREWALLUDPCMD --add-port=$SVCPORTSTART-$SVCPORTSTOP/tcp --add-port=$SVCPORTSTART-$SVCPORTSTOP/udp --zone=$FIREWALLZONE --permanent"
firewall-cmd $TEMP
firewall-cmd --reload
}
function SetProxyInfo {
if [ -n $HTTPPROXY ]; then
export http_proxy=$HTTPPROXY
echo "http_proxy=$HTTPPROXY" >> /etc/environment
fi
}
function IUPackages {
yum install -y \
yum-cron \
yum-plugin-versionlock \
epel-release \
traceroute \
nfs-utils \
nano \
wget \
yum-utils \
device-mapper-persistent-data \
lvm2 \
open-vm-tools \
freeipmi \
iscsi-initiator-utils \
smartmontools
yum update -y
yum install -y \
glances
yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
yum install -y --setopt=obsoletes=0 \
docker-ce-$DOCKERVERSIONSTRING \
docker-ce-cli-$DOCKERVERSIONSTRING \
containerd.io \
docker-compose
yum versionlock add docker-ce
}
function Config_YumCron {
sed -i 's/apply_updates = no/apply_updates = yes/' /etc/yum/yum-cron.conf
systemctl enable yum-cron
systemctl start yum-cron
}
function Config_selinux {
sed -i 's/SELINUX=enforcing/SELINUX=disabled/' /etc/selinux/config
}
function ConfigureNFS {
if [ ! -d "$LOCALNFSPATH" ]; then
mkdir $LOCALNFSPATH
fi
if [ -z "$FSTABEXISTS" ]; then
echo "$NFSEXPORT" >> /etc/fstab
fi
mount "$NFSSERVER:$REMOTENFSPATH"
}
function AddCA {
cp $LOCALNFSPATH/CA/* /etc/pki/ca-trust/source/anchors/
update-ca-trust
}
function ConfigureDocker {
systemctl enable docker
systemctl start docker
}
function RestartDatCompYo {
reboot
}
for i in ${TASKLIST[@]};
do
$i
done
ETCD/CP Deploy Command:
sudo docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.2.2 --server https://ehs-kn-10.*********.ad --token ******* --ca-checksum *************** --etcd --controlplane
Worker Deploy Command:
sudo docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.2.2 --server https://ehs-kn-10.***********.ad --token ******* --ca-checksum *************** --worker
Environment information
- Rancher version (
rancher/rancher
/rancher/server
image tag or shown bottom left in the UI): Rancher v2.2.2 - Installation option (single install/HA): Single Install
Cluster information
- Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Custom
- Machine type (cloud/VM/metal) and specifications (CPU/memory): Metal (Workers: 16 Cores/128GB RAM; ETCD/CP: 4 Cores/16GB RAM)
- Kubernetes version (use
kubectl version
):
Can't access kubectl, this is the configured version for the cluster as per the Web UI.
v1.14.1-rancher1-1
- Docker version (use
docker version
):
[root@EHS-KN-13 ~]# docker version
Client:
Version: 18.06.3-ce
API version: 1.38
Go version: go1.10.3
Git commit: d7080c1
Built: Wed Feb 20 02:26:51 2019
OS/Arch: linux/amd64
Experimental: false
Server:
Engine:
Version: 18.06.3-ce
API version: 1.38 (minimum version 1.12)
Go version: go1.10.3
Git commit: d7080c1
Built: Wed Feb 20 02:28:17 2019
OS/Arch: linux/amd64
Experimental: false
[root@EHS-KN-14 ~]# docker version
Client:
Version: 18.06.3-ce
API version: 1.38
Go version: go1.10.3
Git commit: d7080c1
Built: Wed Feb 20 02:26:51 2019
OS/Arch: linux/amd64
Experimental: false
Server:
Engine:
Version: 18.06.3-ce
API version: 1.38 (minimum version 1.12)
Go version: go1.10.3
Git commit: d7080c1
Built: Wed Feb 20 02:28:17 2019
OS/Arch: linux/amd64
Experimental: false
Error information
This cluster is currently Provisioning; areas that interact directly with it will not be available until the API is ready.
Failed to get job complete status for job rke-network-plugin-deploy-job in namespace kube-system
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 4
- Comments: 34 (3 by maintainers)
Same issue , always need to run it twice to get the cluster created because the first run always fails with that error
Issue still exists in RKE 1.0.0
Took a closer look at this on my end - I found errors in kubelet logs which suggested I was missing the
rancher/pause
container in my environment.Adding
rancher/pause:3.1
to the environment resolved the “Failed to get job complete status for job rke-network-plugin-deploy-job” issue for me and I was at least able to deploy the cluster. I suspect it was failing on the rke-network-plugin-deploy-job since it wasn’t able to wait for or connect to the network deployment.I had the same issue, and these two steps solved my problem
addon_job_timeout
In my case, one of the nodes had
DiskPressure
stateJust had the same issue as other described. Launched the second
rke up
with the same settings and it completed successfully.Decided to share one more possible issue and solution to probably save time to someone else: missing default route (what can be noticed in offline install scenarios) also can be the reason.
In such case errors are the same but any of the described solutions would not help, while it is enough to configure default route.
Sure this is a bit not an RKE issue while question to machines bootstrap and networking configuration (including DHCP server side), but still.
This fix worked for me as well. I tried rerunning rke up multiple times but it continued to fail. Because I am testing on a single EC2 node I had set the node address in the config to local host IP 127.0.0.1. I changed this to the FQDN of the instance, reran rke up (on that instance) and it finished successfully.
The fix for me with the exact same issue @DarkRider1768 was to change my
cluster.yml
nodes[0].address
from127.0.0.1
orlocalhost
to my computer name when doing an initial deploy on a brand new ubuntu 18.04. I don’t know if that helps…