openshift-ansible: upgrade.yml playbook fails to upgrade from 3.10.0 to 3.10.z
Description
Upgrading from v3.10.0 to the latest v3.10.z using the upgrade.yml ansible playbook fails at the openshift_service_catalog : Verify that the catalog api server is running task.
Version
Please put the following version information in the code block indicated below.
- Your ansible version per
ansible --version
ansible 2.6.5
config file = /etc/ansible/ansible.cfg
configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
ansible python module location = /usr/lib/python2.7/site-packages/ansible
executable location = /usr/bin/ansible
python version = 2.7.5 (default, Jul 13 2018, 13:06:57) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]
If you’re operating from a git clone:
- The output of
git describe
openshift-ansible-3.10.67-1
I am using the here the release-3.10 git branch.
Place the output between the code block below:
oc v3.10.0+0c4577e-1
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO
Server https://inst1.mydomain.com:8443
openshift v3.10.0+2084755-68
kubernetes v1.10.0+b81c8f8
Steps To Reproduce
- Install 3.10.0
- Upgrade to 3.10.z
Expected Results
I expect the upgrade to succeed including all ansible tasks.
For the upgrade I use the following command:
ansible-playbook playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade.yml
Observed Results
After around 25 minutes the upgrade fails with the following detailed output:
TASK [openshift_service_catalog : Verify that the catalog api server is running] ***********************************************************************************************************************************************************************************************
Tuesday 30 October 2018 23:36:42 +0100 (0:00:00.329) 0:14:37.320 *******
FAILED - RETRYING: Verify that the catalog api server is running (60 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (59 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (58 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (57 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (56 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (55 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (54 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (53 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (52 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (51 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (50 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (49 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (48 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (47 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (46 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (45 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (44 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (43 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (42 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (41 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (40 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (39 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (38 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (37 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (36 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (35 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (34 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (33 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (32 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (31 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (30 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (29 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (28 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (27 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (26 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (25 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (24 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (23 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (22 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (21 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (20 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (19 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (18 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (17 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (16 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (15 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (14 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (13 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (12 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (11 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (10 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (9 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (8 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (7 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (6 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (5 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (4 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (3 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (2 retries left).
FAILED - RETRYING: Verify that the catalog api server is running (1 retries left).
fatal: [inst1.mydomain.com]: FAILED! => {"attempts": 60, "changed": false, "cmd": ["curl", "-k", "https://apiserver.kube-service-catalog.svc/healthz"], "delta": "0:00:00.283071", "end": "2018-10-30 23:47:16.335563", "msg": "non-zero return code", "rc": 6, "start": "2018-10-30 23:47:16.052492", "stderr": " % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (6) Could not resolve host: apiserver.kube-service-catalog.svc; Unknown error", "stderr_lines": [" % Total % Received % Xferd Average Speed Time Time Time Current", " Dload Upload Total Spent Left Speed", "", " 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (6) Could not resolve host: apiserver.kube-service-catalog.svc; Unknown error"], "stdout": "", "stdout_lines": []}
...ignoring
TASK [openshift_service_catalog : Check status in the kube-service-catalog namespace] ******************************************************************************************************************************************************************************************
Tuesday 30 October 2018 23:47:16 +0100 (0:10:33.843) 0:25:11.163 *******
changed: [inst1.mydomain.com]
TASK [openshift_service_catalog : debug] ***************************************************************************************************************************************************************************************************************************************
Tuesday 30 October 2018 23:47:17 +0100 (0:00:01.062) 0:25:12.225 *******
ok: [inst1.mydomain.com] => {
"msg": [
"In project kube-service-catalog on server https://inst1.mydomain.com:8443",
"",
"https://apiserver-kube-service-catalog.router.default.svc.cluster.local (passthrough) to pod port secure (svc/apiserver)",
" daemonset/apiserver manages docker.io/openshift/origin-service-catalog:v3.10",
" generation #5 running for 6 hours - 0/1 pods growing to 1",
" pod/apiserver-qxw2d runs docker.io/openshift/origin-service-catalog:v3.10",
"",
"svc/controller-manager - 172.30.102.9:443 -> 6443",
" daemonset/controller-manager manages docker.io/openshift/origin-service-catalog:v3.10",
" generation #1 running for 6 hours - 1 pod",
" pod/controller-manager-dw7mg runs docker.io/openshift/origin-service-catalog:v3.10",
"",
"",
"2 infos identified, use 'oc status -v' to see details."
]
}
TASK [openshift_service_catalog : Get pods in the kube-service-catalog namespace] **********************************************************************************************************************************************************************************************
Tuesday 30 October 2018 23:47:17 +0100 (0:00:00.286) 0:25:12.512 *******
changed: [inst1.mydomain.com]
TASK [openshift_service_catalog : debug] ***************************************************************************************************************************************************************************************************************************************
Tuesday 30 October 2018 23:47:18 +0100 (0:00:01.086) 0:25:13.599 *******
ok: [inst1.mydomain.com] => {
"msg": [
"NAME READY STATUS RESTARTS AGE IP NODE",
"apiserver-qxw2d 0/1 ContainerCreating 0 10m <none> inst1",
"controller-manager-dw7mg 1/1 Running 3 6h 10.128.0.33 inst1"
]
}
TASK [openshift_service_catalog : Get events in the kube-service-catalog namespace] ********************************************************************************************************************************************************************************************
Tuesday 30 October 2018 23:47:19 +0100 (0:00:00.286) 0:25:13.886 *******
changed: [inst1.mydomain.com]
TASK [openshift_service_catalog : debug] ***************************************************************************************************************************************************************************************************************************************
Tuesday 30 October 2018 23:47:20 +0100 (0:00:01.010) 0:25:14.896 *******
ok: [inst1.mydomain.com] => {
"msg": [
"LAST SEEN FIRST SEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE",
"50m 50m 1 apiserver-cgr8g.156281b37d153c9d Pod Warning NetworkFailed openshift-sdn, inst1 The pod's network interface has been lost and the pod will be stopped.",
"50m 50m 1 apiserver-cgr8g.156281b3c1cb7ef3 Pod Normal SandboxChanged kubelet, inst1 Pod sandbox changed, it will be killed and re-created.",
"50m 50m 1 apiserver-cgr8g.156281b3cc4700cc Pod spec.containers{apiserver} Normal Killing kubelet, inst1 Killing container with id docker://apiserver:Need to kill Pod",
"50m 50m 1 apiserver-cgr8g.156281b4279ea7e8 Pod spec.containers{apiserver} Normal Pulled kubelet, inst1 Container image \"docker.io/openshift/origin-service-catalog:v3.10\" already present on machine",
"50m 50m 1 apiserver-cgr8g.156281b42b2b4c3c Pod spec.containers{apiserver} Normal Created kubelet, inst1 Created container",
"50m 50m 1 apiserver-cgr8g.156281b44513ec01 Pod spec.containers{apiserver} Normal Started kubelet, inst1 Started container",
"33m 33m 1 apiserver-cgr8g.1562829de08d5bbe Pod spec.containers{apiserver} Normal Killing kubelet, inst1 Killing container with id docker://apiserver:Need to kill Pod",
"33m 33m 1 apiserver-ckgrq.156282a166304ee1 Pod spec.containers{apiserver} Normal Pulled kubelet, inst1 Container image \"docker.io/openshift/origin-service-catalog:v3.10\" already present on machine",
"33m 33m 1 apiserver-ckgrq.156282a169210ad3 Pod spec.containers{apiserver} Normal Created kubelet, inst1 Created container",
"33m 33m 1 apiserver-ckgrq.156282a174869589 Pod spec.containers{apiserver} Normal Started kubelet, inst1 Started container",
"10m 10m 1 apiserver-ckgrq.156283d8f57682d9 Pod spec.containers{apiserver} Normal Killing kubelet, inst1 Killing container with id docker://apiserver:Need to kill Pod",
"10m 10m 3 apiserver-ckgrq.156283d904a6ba22 Pod Warning FailedKillPod kubelet, inst1 error killing pod: failed to \"KillPodSandbox\" for \"1fa3ff30-dc91-11e8-915b-aa0000000001\" with KillPodSandboxError: \"rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod \\\"apiserver-ckgrq_kube-service-catalog\\\" network: failed to send CNI request: Post http://dummy/: dial unix /var/run/openshift-sdn/cni-server.sock: connect: connection refused\"",
"",
"10m 10m 1 apiserver-qxw2d.156283da2c693435 Pod Warning FailedCreatePodSandBox kubelet, inst1 Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container \"a8bf450caab80a5f7807dff38c58574a5c501b7da5e49921230f2aa27ed9cc49\" network for pod \"apiserver-qxw2d\": NetworkPlugin cni failed to set up pod \"apiserver-qxw2d_kube-service-catalog\" network: failed to send CNI request: Post http://dummy/: dial unix /var/run/openshift-sdn/cni-server.sock: connect: connection refused, failed to clean up sandbox container \"a8bf450caab80a5f7807dff38c58574a5c501b7da5e49921230f2aa27ed9cc49\" network for pod \"apiserver-qxw2d\": NetworkPlugin cni failed to teardown pod \"apiserver-qxw2d_kube-service-catalog\" network: failed to send CNI request: Post http://dummy/: dial unix /var/run/openshift-sdn/cni-server.sock: connect: connection refused]",
"43s 10m 46 apiserver-qxw2d.156283da4e6dacd8 Pod Normal SandboxChanged kubelet, inst1 Pod sandbox changed, it will be killed and re-created.",
"33m 33m 1 apiserver.1562829dd52d21e1 DaemonSet Normal SuccessfulDelete daemonset-controller Deleted pod: apiserver-cgr8g",
"33m 33m 1 apiserver.156282a0be0459d4 DaemonSet Normal SuccessfulCreate daemonset-controller Created pod: apiserver-ckgrq",
"10m 10m 1 apiserver.156283d8e957480f DaemonSet Normal SuccessfulDelete daemonset-controller Deleted pod: apiserver-ckgrq",
"10m 10m 1 apiserver.156283d9e96976dc DaemonSet Normal SuccessfulCreate daemonset-controller Created pod: apiserver-qxw2d",
"1h 1h 1 controller-manager-dw7mg.15627f66b7cd05a8 Pod spec.containers{controller-manager} Normal Pulled kubelet, inst1 Container image \"docker.io/openshift/origin-service-catalog:v3.10\" already present on machine",
"1h 1h 1 controller-manager-dw7mg.15627f66ba821e51 Pod spec.containers{controller-manager} Normal Created kubelet, inst1 Created container",
"1h 1h 1 controller-manager-dw7mg.15627f66cae24cc0 Pod spec.containers{controller-manager} Normal Started kubelet, inst1 Started container",
"50m 50m 1 controller-manager-dw7mg.156281b389cb2a78 Pod Warning NetworkFailed openshift-sdn, inst1 The pod's network interface has been lost and the pod will be stopped.",
"50m 50m 1 controller-manager-dw7mg.156281b3c637173b Pod Normal SandboxChanged kubelet, inst1 Pod sandbox changed, it will be killed and re-created.",
"50m 50m 1 controller-manager-dw7mg.156281b3ce6a0db6 Pod spec.containers{controller-manager} Normal Killing kubelet, inst1 Killing container with id docker://controller-manager:Need to kill Pod",
"50m 50m 1 controller-manager-dw7mg.156281b45432cb45 Pod spec.containers{controller-manager} Normal Pulled kubelet, inst1 Container image \"docker.io/openshift/origin-service-catalog:v3.10\" already present on machine",
"50m 50m 1 controller-manager-dw7mg.156281b456dd1dc3 Pod spec.containers{controller-manager} Normal Created kubelet, inst1 Created container",
"50m 50m 1 controller-manager-dw7mg.156281b46b91ea63 Pod spec.containers{controller-manager} Normal Started kubelet, inst1 Started container",
"50m 50m 1 service-catalog-controller-manager.156281b488b01722 ConfigMap Normal LeaderElection service-catalog-controller-manager controller-manager-dw7mg-external-service-catalog-controller became leader"
]
}
TASK [openshift_service_catalog : Get pod logs] ********************************************************************************************************************************************************************************************************************************
Tuesday 30 October 2018 23:47:20 +0100 (0:00:00.296) 0:25:15.192 *******
fatal: [inst1.mydomain.com]: FAILED! => {"changed": true, "cmd": ["oc", "logs", "daemonset/apiserver", "--tail=200", "--config=/etc/origin/master/admin.kubeconfig", "-n", "kube-service-catalog"], "delta": "0:00:00.420825", "end": "2018-10-30 23:47:21.452590", "msg": "non-zero return code", "rc": 1, "start": "2018-10-30 23:47:21.031765", "stderr": "Error from server (BadRequest): container \"apiserver\" in pod \"apiserver-qxw2d\" is waiting to start: ContainerCreating", "stderr_lines": ["Error from server (BadRequest): container \"apiserver\" in pod \"apiserver-qxw2d\" is waiting to start: ContainerCreating"], "stdout": "", "stdout_lines": []}
...ignoring
TASK [openshift_service_catalog : debug] ***************************************************************************************************************************************************************************************************************************************
Tuesday 30 October 2018 23:47:21 +0100 (0:00:01.091) 0:25:16.284 *******
ok: [inst1.mydomain.com] => {
"msg": []
}
TASK [openshift_service_catalog : Report errors] *******************************************************************************************************************************************************************************************************************************
Tuesday 30 October 2018 23:47:21 +0100 (0:00:00.282) 0:25:16.566 *******
fatal: [inst1.mydomain.com]: FAILED! => {"changed": false, "msg": "Catalog install failed."}
PLAY RECAP *********************************************************************************************************************************************************************************************************************************************************************
inst1.mydomain.com : ok=679 changed=170 unreachable=0 failed=1
inst2.mydomain.com : ok=19 changed=0 unreachable=0 failed=0
inst3.mydomain.com : ok=19 changed=0 unreachable=0 failed=0
localhost : ok=22 changed=0 unreachable=0 failed=0
Tuesday 30 October 2018 23:47:22 +0100 (0:00:00.439) 0:25:17.006 *******
===============================================================================
openshift_service_catalog : Verify that the catalog api server is running --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 633.84s
openshift_web_console : Pause for the web console deployment to start -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 30.28s
openshift_control_plane : verify API server ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 18.32s
openshift_service_catalog : oc_process --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 16.07s
openshift_node : Wait for master API to come back online --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 10.99s
Ensure openshift-ansible installer package deps are installed ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 9.80s
openshift_excluder : Get available excluder version --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 9.19s
openshift_control_plane : Wait for APIs to become available ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 8.86s
openshift_control_plane : Wait for APIs to become available ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 8.75s
openshift_node : update package meta data to speed install later. ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 8.26s
openshift_manageiq : Configure role/user permissions -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 8.21s
openshift_sdn : Copy templates to temp directory ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 6.76s
Upgrade all storage ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 5.95s
Migrate storage post policy reconciliation ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 5.91s
openshift_node : Update journald setup ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 5.90s
openshift_node_group : remove templated files --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 5.73s
openshift_node_group : Copy templates to temp directory ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 5.18s
Run health checks (upgrade) --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 5.06s
Run variable sanity checks ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 4.92s
openshift_node : Approve the node --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 4.51s
Failure summary:
1. Hosts: inst1.mydomain.com
Play: Upgrade Service Catalog
Task: Report errors
Message: Catalog install failed.
Additional Information
Provide any additional information which may help us diagnose the issue.
- OS:
CentOS Linux release 7.5.1804 (Core) - Inventory file:
[OSEv3:children]
masters
nodes
etcd
new_nodes
# Set variables common for all OSEv3 hosts
[OSEv3:vars]
# SSH user, this user should allow ssh based auth without requiring a password
ansible_ssh_user=root
# If ansible_ssh_user is not root, ansible_become must be set to true
#ansible_become=true
openshift_deployment_type=origin
# uncomment the following to enable htpasswd authentication; defaults to AllowAllPasswordIdentityProvider
#openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider'}]
# Defining htpasswd file
#openshift_master_htpasswd_file=/etc/origin/master/htpasswd
# Disable memory availability check
openshift_disable_check=memory_availability
# host group for masters
[masters]
inst1.mydomain.com
# host group for etcd
[etcd]
inst1.mydomain.com
# host group for nodes, includes region info
[nodes]
inst1.mydomain.com openshift_node_group_name='node-config-master-infra'
inst2.mydomain.com openshift_node_group_name='node-config-compute'
inst3.mydomain.com openshift_node_group_name='node-config-compute'
[new_nodes]
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 16 (2 by maintainers)
I dug around a bit but couldn’t quite figure out why my
node-config.yamlwasn’t being regenerated during the upgrade.I was able to workaround this by stopping & starting the relevant services during the
Verify that the catalog api server is runningplay:@vrutkovs thanks for confirming @nagonzalez I tried again to run the
upgrade.ymlfor v3_10 and here is the list of my/etc/origin/nodedirectory before the upgrade:and after the failing upgrade:
As you mentioned the
node-config.ymlfile is missing and gets deleted during the upgrade. How could this happen? and how did you fix it? I suspect this is a bug int he upgrade.yml playbook…