origin: etcd crashing on upgraded 3.11 instance due to failing liveness probe

After starting the master the api server dies after some time. it looks like the issue is related to etcd because the etcd pod continuosly crashes until it takes the api server out with it. It usually takes about 10 minutes. Looking in the etcd logs I see:

2018-12-04 15:27:30.544094 I | pkg/flags: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=https://X.X.X.X:2379
2018-12-04 15:27:30.544551 I | pkg/flags: recognized and used environment variable ETCD_CERT_FILE=/etc/etcd/server.crt
2018-12-04 15:27:30.544565 I | pkg/flags: recognized and used environment variable ETCD_CLIENT_CERT_AUTH=true
2018-12-04 15:27:30.544581 I | pkg/flags: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd/
2018-12-04 15:27:30.544589 I | pkg/flags: recognized and used environment variable ETCD_DEBUG=False
2018-12-04 15:27:30.544602 I | pkg/flags: recognized and used environment variable ETCD_ELECTION_TIMEOUT=2500
2018-12-04 15:27:30.544620 I | pkg/flags: recognized and used environment variable ETCD_HEARTBEAT_INTERVAL=500
2018-12-04 15:27:30.544633 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=https://10.0.4.57:2380
2018-12-04 15:27:30.544648 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=new
2018-12-04 15:27:30.544667 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster-1
2018-12-04 15:27:30.544677 I | pkg/flags: recognized and used environment variable ETCD_KEY_FILE=/etc/etcd/server.key
2018-12-04 15:27:30.544687 I | pkg/flags: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=https://10.0.4.57:2379
2018-12-04 15:27:30.544696 I | pkg/flags: recognized and used environment variable ETCD_LISTEN_PEER_URLS=https://10.0.4.57:2380
2018-12-04 15:27:30.544723 I | pkg/flags: recognized and used environment variable ETCD_NAME=ip-10-0-4-57.ec2.internal
2018-12-04 15:27:30.544734 I | pkg/flags: recognized and used environment variable ETCD_PEER_CERT_FILE=/etc/etcd/peer.crt
2018-12-04 15:27:30.544742 I | pkg/flags: recognized and used environment variable ETCD_PEER_CLIENT_CERT_AUTH=true
2018-12-04 15:27:30.544752 I | pkg/flags: recognized and used environment variable ETCD_PEER_KEY_FILE=/etc/etcd/peer.key
2018-12-04 15:27:30.544761 I | pkg/flags: recognized and used environment variable ETCD_PEER_TRUSTED_CA_FILE=/etc/etcd/ca.crt
2018-12-04 15:27:30.544778 I | pkg/flags: recognized and used environment variable ETCD_QUOTA_BACKEND_BYTES=4294967296
2018-12-04 15:27:30.544794 I | pkg/flags: recognized and used environment variable ETCD_TRUSTED_CA_FILE=/etc/etcd/ca.crt
2018-12-04 15:27:30.544828 W | pkg/flags: unrecognized environment variable ETCD_INITIAL_CLUSTER=
2018-12-04 15:27:30.544868 I | etcdmain: etcd Version: 3.2.22
2018-12-04 15:27:30.544876 I | etcdmain: Git SHA: 1674e682f
2018-12-04 15:27:30.544881 I | etcdmain: Go Version: go1.8.7
2018-12-04 15:27:30.544887 I | etcdmain: Go OS/Arch: linux/amd64
2018-12-04 15:27:30.544892 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2018-12-04 15:27:30.544957 W | etcdmain: found invalid file/dir openshift-backup-post-3.0-20180526200403 under data dir /var/lib/etcd/ (Ignore this if you are upgrading etcd)
2018-12-04 15:27:30.544965 W | etcdmain: found invalid file/dir openshift-backup-post-3.0-20180526200704 under data dir /var/lib/etcd/ (Ignore this if you are upgrading etcd)
2018-12-04 15:27:30.544971 W | etcdmain: found invalid file/dir openshift-backup-post-3.0-20180811182910 under data dir /var/lib/etcd/ (Ignore this if you are upgrading etcd)
2018-12-04 15:27:30.544976 W | etcdmain: found invalid file/dir openshift-backup-post-3.0-20181126013050 under data dir /var/lib/etcd/ (Ignore this if you are upgrading etcd)
2018-12-04 15:27:30.544981 W | etcdmain: found invalid file/dir openshift-backup-pre-upgrade-20180526200357 under data dir /var/lib/etcd/ (Ignore this if you are upgrading etcd)
2018-12-04 15:27:30.544990 W | etcdmain: found invalid file/dir openshift-backup-pre-upgrade-20180526200658 under data dir /var/lib/etcd/ (Ignore this if you are upgrading etcd)
2018-12-04 15:27:30.544996 W | etcdmain: found invalid file/dir openshift-backup-pre-upgrade-20180811182818 under data dir /var/lib/etcd/ (Ignore this if you are upgrading etcd)
2018-12-04 15:27:30.545001 W | etcdmain: found invalid file/dir openshift-backup-pre-upgrade-20181126012952 under data dir /var/lib/etcd/ (Ignore this if you are upgrading etcd)
2018-12-04 15:27:30.545012 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2018-12-04 15:27:30.545043 I | embed: peerTLS: cert = /etc/etcd/peer.crt, key = /etc/etcd/peer.key, ca = , trusted-ca = /etc/etcd/ca.crt, client-cert-auth = true
2018-12-04 15:27:30.546220 I | embed: listening for peers on https://X.X.X.X:2380
2018-12-04 15:27:30.546300 I | embed: listening for client requests on X.X.X.X:2379
2018-12-04 15:27:30.561379 I | etcdserver: recovered store from snapshot at index 64300643
2018-12-04 15:27:30.564654 I | mvcc: restore compact to 57105449
2018-12-04 15:27:30.614314 I | etcdserver: name = ip-X-X-X-X
2018-12-04 15:27:30.614364 I | etcdserver: data dir = /var/lib/etcd/
2018-12-04 15:27:30.614387 I | etcdserver: member dir = /var/lib/etcd/member
2018-12-04 15:27:30.614398 I | etcdserver: heartbeat = 500ms
2018-12-04 15:27:30.614404 I | etcdserver: election = 2500ms
2018-12-04 15:27:30.614410 I | etcdserver: snapshot count = 100000
2018-12-04 15:27:30.614437 I | etcdserver: advertise client URLs = https://X.X.X.X:2379
2018-12-04 15:27:30.769559 I | etcdserver: restarting member ccd4cb5684c5b1d4 in cluster d36b3a36535103f8 at commit index 64316408
2018-12-04 15:27:30.772214 I | raft: ccd4cb5684c5b1d4 became follower at term 109
2018-12-04 15:27:30.772263 I | raft: newRaft ccd4cb5684c5b1d4 [peers: [ccd4cb5684c5b1d4], term: 109, commit: 64316408, applied: 64300643, lastindex: 64316408, lastterm: 109]
2018-12-04 15:27:30.773451 I | etcdserver/api: enabled capabilities for version 3.2
2018-12-04 15:27:30.773487 I | etcdserver/membership: added member ccd4cb5684c5b1d4 [https://10.0.4.57:2380] to cluster d36b3a36535103f8 from store
2018-12-04 15:27:30.773500 I | etcdserver/membership: set the cluster version to 3.2 from store
2018-12-04 15:27:30.793965 I | mvcc: restore compact to 57105449
2018-12-04 15:27:30.827545 W | auth: simple token is not cryptographically signed
2018-12-04 15:27:30.833275 I | etcdserver: starting server... [version: 3.2.22, cluster version: 3.2]
2018-12-04 15:27:30.835797 I | embed: ClientTLS: cert = /etc/etcd/server.crt, key = /etc/etcd/server.key, ca = , trusted-ca = /etc/etcd/ca.crt, client-cert-auth = true
2018-12-04 15:27:30.837262 I | etcdserver: ccd4cb5684c5b1d4 as single-node; fast-forwarding 4 ticks (election ticks 5)
2018-12-04 15:27:31.773948 I | raft: ccd4cb5684c5b1d4 is starting a new election at term 109
2018-12-04 15:27:31.774054 I | raft: ccd4cb5684c5b1d4 became candidate at term 110
2018-12-04 15:27:31.774105 I | raft: ccd4cb5684c5b1d4 received MsgVoteResp from ccd4cb5684c5b1d4 at term 110
2018-12-04 15:27:31.774124 I | raft: ccd4cb5684c5b1d4 became leader at term 110
2018-12-04 15:27:31.774139 I | raft: raft.node: ccd4cb5684c5b1d4 elected leader ccd4cb5684c5b1d4 at term 110
2018-12-04 15:27:31.775580 I | etcdserver: published {Name:ip-X-X-X-X ClientURLs:[https://X.X.X.X:2379]} to cluster d36b3a36535103f8
2018-12-04 15:27:31.775647 I | embed: ready to serve client requests
2018-12-04 15:27:31.776818 I | embed: serving client requests on 10.0.4.57:2379
WARNING: 2018/12/04 15:27:32 Failed to dial X.X.X.X:2379: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate"; please retry.

Here are the logs from the crashed API server:

I1204 15:37:14.749434       1 feature_gate.go:194] feature gates: map[OriginatingIdentity:true]
I1204 15:37:14.751323       1 feature_gate.go:194] feature gates: map[OriginatingIdentity:true NamespacedServiceBroker:true]
I1204 15:37:14.752882       1 hyperkube.go:192] Service Catalog version v3.11.0+58d854a-38;Upstream:v0.1.35 (built 2018-10-19T16:26:14Z)
W1204 15:37:15.953979       1 authentication.go:245] Unable to get configmap/extension-apiserver-authentication in kube-system.  Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA'
Error: Get https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 172.30.0.1:443: connect: connection refused
[root@ip-10-0-4-57 ~]# docker logs e88bce8ff562
E1204 15:37:14.571957       1 helpers.go:134] Encountered config error json: unknown field "masterCount" in object *config.MasterConfig, raw JSON:
{"admissionConfig":{"pluginConfig":{"BuildDefaults":{"configuration":{"apiVersion":"v1","env":[],"kind":"BuildDefaultsConfig","resources":{"limits":{},"requests":{}}}},"BuildOverrides":{"configuration":{"apiVersion":"v1","kind":"BuildOverridesConfig"}},"openshift.io/ImagePolicy":{"configuration":{"apiVersion":"v1","executionRules":[{"matchImageAnnotations":[{"key":"images.openshift.io/deny-execution","value":"true"}],"name":"execution-denied","onResources":[{"resource":"pods"},{"resource":"builds"}],"reject":true,"skipOnResolutionFailure":true}],"kind":"ImagePolicyConfig"}}}},"aggregatorConfig":{"proxyClientInfo":{"certFile":"aggregator-front-proxy.crt","keyFile":"aggregator-front-proxy.key"}},"apiLevels":["v1"],"apiVersion":"v1","authConfig":{"requestHeader":{"clientCA":"front-proxy-ca.crt","clientCommonNames":["aggregator-front-proxy"],"extraHeaderPrefixes":["X-Remote-Extra-"],"groupHeaders":["X-Remote-Group"],"usernameHeaders":["X-Remote-User"]}},"controllerConfig":{"election":{"lockName":"openshift-master-controllers"},"serviceServingCert":{"signer":{"certFile":"service-signer.crt","keyFile":"service-signer.key"}}},"controllers":"*","corsAllowedOrigins":["(?i)//127\\.0\\.0\\.1(:|\\z)","(?i)//localhost(:|\\z)","(?i)//10\\.0\\.4\\.57(:|\\z)","(?i)//openshift\\.default\\.svc(:|\\z)","(?i)//os\\-int\\.tremolo\\.local(:|\\z)","(?i)//kubernetes\\.default\\.svc\\.cluster\\.local(:|\\z)","(?i)//kubernetes(:|\\z)","(?i)//openshift\\.default(:|\\z)","(?i)//kubernetes\\.default(:|\\z)","(?i)//172\\.30\\.0\\.1(:|\\z)","(?i)//ip\\-10\\-0\\-4\\-57\\.ec2\\.internal(:|\\z)","(?i)//os\\.tremolo\\.io(:|\\z)","(?i)//openshift\\.default\\.svc\\.cluster\\.local(:|\\z)","(?i)//kubernetes\\.default\\.svc(:|\\z)","(?i)//openshift(:|\\z)"],"dnsConfig":{"bindAddress":"0.0.0.0:8053","bindNetwork":"tcp4"},"etcdClientInfo":{"ca":"master.etcd-ca.crt","certFile":"master.etcd-client.crt","keyFile":"master.etcd-client.key","urls":["https://ip-10-0-4-57.ec2.internal:2379"]},"etcdStorageConfig":{"kubernetesStoragePrefix":"kubernetes.io","kubernetesStorageVersion":"v1","openShiftStoragePrefix":"openshift.io","openShiftStorageVersion":"v1"},"imageConfig":{"format":"docker.io/openshift/origin-${component}:${version}","latest":false},"imagePolicyConfig":{"internalRegistryHostname":"docker-registry.default.svc:5000"},"kind":"MasterConfig","kubeletClientInfo":{"ca":"ca-bundle.crt","certFile":"master.kubelet-client.crt","keyFile":"master.kubelet-client.key","port":10250},"kubernetesMasterConfig":{"apiServerArguments":{"cloud-config":["/etc/origin/cloudprovider/aws.conf"],"cloud-provider":["aws"],"runtime-config":[],"storage-backend":["etcd3"],"storage-media-type":["application/vnd.kubernetes.protobuf"]},"controllerArguments":{"cloud-config":["/etc/origin/cloudprovider/aws.conf"],"cloud-provider":["aws"],"cluster-signing-cert-file":["/etc/origin/master/ca.crt"],"cluster-signing-key-file":["/etc/origin/master/ca.key"],"pv-recycler-pod-template-filepath-hostpath":["/etc/origin/master/recycler_pod.yaml"],"pv-recycler-pod-template-filepath-nfs":["/etc/origin/master/recycler_pod.yaml"]},"masterCount":1,"masterIP":"10.0.4.57","podEvictionTimeout":null,"proxyClientInfo":{"certFile":"master.proxy-client.crt","keyFile":"master.proxy-client.key"},"schedulerArguments":null,"schedulerConfigFile":"/etc/origin/master/scheduler.json","servicesNodePortRange":"","servicesSubnet":"172.30.0.0/16","staticNodeNames":[]},"masterClients":{"externalKubernetesClientConnectionOverrides":{"acceptContentTypes":"application/vnd.kubernetes.protobuf,application/json","burst":400,"contentType":"application/vnd.kubernetes.protobuf","qps":200},"externalKubernetesKubeConfig":"","openshiftLoopbackClientConnectionOverrides":{"acceptContentTypes":"application/vnd.kubernetes.protobuf,application/json","burst":600,"contentType":"application/vnd.kubernetes.protobuf","qps":300},"openshiftLoopbackKubeConfig":"openshift-master.kubeconfig"},"masterPublicURL":"https://os.tremolo.io","networkConfig":{"clusterNetworks":[{"cidr":"10.128.0.0/14","hostSubnetLength":9}],"externalIPNetworkCIDRs":["0.0.0.0/0"],"networkPluginName":"redhat/openshift-ovs-subnet","serviceNetworkCIDR":"172.30.0.0/16"},"oauthConfig":{"assetPublicURL":"https://os.tremolo.io/console/","grantConfig":{"method":"auto"},"identityProviders":[{"challenge":false,"login":true,"mappingMethod":"claim","name":"unison","provider":{"apiVersion":"v1","claims":{"email":["email"],"id":["sub"],"name":["name"],"preferredUsername":["preferred_username"]},"clientID":"openshift","clientSecret":"8ncdvSgw6JUtRoI95P3c4ANcxHPwq7LYbz8s7NNQfnga0KNJIhApVVge0KSaxC4","kind":"OpenIDIdentityProvider","urls":{"authorize":"https://apps.tremolosecurity.com/auth/idp/OpenShiftIdP/auth","token":"https://apps.tremolosecurity.com/auth/idp/OpenShiftIdP/token"}}}],"masterCA":"ca-bundle.crt","masterPublicURL":"https://os.tremolo.io","masterURL":"https://os-int.tremolo.local","sessionConfig":{"sessionMaxAgeSeconds":3600,"sessionName":"ssn","sessionSecretsFile":"/etc/origin/master/session-secrets.yaml"},"tokenConfig":{"accessTokenMaxAgeSeconds":86400,"authorizeTokenMaxAgeSeconds":500}},"pauseControllers":false,"policyConfig":{"bootstrapPolicyFile":"/etc/origin/master/policy.json","openshiftInfrastructureNamespace":"openshift-infra","openshiftSharedResourcesNamespace":"openshift"},"projectConfig":{"defaultNodeSelector":"node-role.kubernetes.io/compute=true","projectRequestMessage":"","projectRequestTemplate":"","securityAllocator":{"mcsAllocatorRange":"s0:/2","mcsLabelsPerProject":5,"uidAllocatorRange":"1000000000-1999999999/10000"}},"routingConfig":{"subdomain":"router.default.svc.cluster.local"},"serviceAccountConfig":{"limitSecretReferences":false,"managedNames":["default","builder","deployer"],"masterCA":"ca-bundle.crt","privateKeyFile":"serviceaccounts.private.key","publicKeyFiles":["serviceaccounts.public.key"]},"servingInfo":{"bindAddress":"0.0.0.0:443","bindNetwork":"tcp4","certFile":"master.server.crt","clientCA":"ca.crt","keyFile":"master.server.key","maxRequestsInFlight":500,"requestTimeoutSeconds":3600},"volumeConfig":{"dynamicProvisioningEnabled":true}}
I1204 15:37:14.586394       1 plugins.go:84] Registered admission plugin "NamespaceLifecycle"
I1204 15:37:14.586436       1 plugins.go:84] Registered admission plugin "Initializers"
I1204 15:37:14.586448       1 plugins.go:84] Registered admission plugin "ValidatingAdmissionWebhook"
I1204 15:37:14.586504       1 plugins.go:84] Registered admission plugin "MutatingAdmissionWebhook"
I1204 15:37:14.586516       1 plugins.go:84] Registered admission plugin "AlwaysAdmit"
I1204 15:37:14.586525       1 plugins.go:84] Registered admission plugin "AlwaysPullImages"
I1204 15:37:14.586535       1 plugins.go:84] Registered admission plugin "LimitPodHardAntiAffinityTopology"
I1204 15:37:14.586546       1 plugins.go:84] Registered admission plugin "DefaultTolerationSeconds"
I1204 15:37:14.586555       1 plugins.go:84] Registered admission plugin "AlwaysDeny"
I1204 15:37:14.586568       1 plugins.go:84] Registered admission plugin "EventRateLimit"
I1204 15:37:14.586579       1 plugins.go:84] Registered admission plugin "DenyEscalatingExec"
I1204 15:37:14.586587       1 plugins.go:84] Registered admission plugin "DenyExecOnPrivileged"
I1204 15:37:14.586597       1 plugins.go:84] Registered admission plugin "ExtendedResourceToleration"
I1204 15:37:14.586605       1 plugins.go:84] Registered admission plugin "OwnerReferencesPermissionEnforcement"
I1204 15:37:14.586618       1 plugins.go:84] Registered admission plugin "ImagePolicyWebhook"
I1204 15:37:14.586629       1 plugins.go:84] Registered admission plugin "LimitRanger"
I1204 15:37:14.586640       1 plugins.go:84] Registered admission plugin "NamespaceAutoProvision"
I1204 15:37:14.586650       1 plugins.go:84] Registered admission plugin "NamespaceExists"
I1204 15:37:14.586660       1 plugins.go:84] Registered admission plugin "NodeRestriction"
I1204 15:37:14.586671       1 plugins.go:84] Registered admission plugin "PersistentVolumeLabel"
I1204 15:37:14.586681       1 plugins.go:84] Registered admission plugin "PodNodeSelector"
I1204 15:37:14.586691       1 plugins.go:84] Registered admission plugin "PodPreset"
I1204 15:37:14.586701       1 plugins.go:84] Registered admission plugin "PodTolerationRestriction"
I1204 15:37:14.586712       1 plugins.go:84] Registered admission plugin "ResourceQuota"
I1204 15:37:14.586722       1 plugins.go:84] Registered admission plugin "PodSecurityPolicy"
I1204 15:37:14.586730       1 plugins.go:84] Registered admission plugin "Priority"
I1204 15:37:14.586741       1 plugins.go:84] Registered admission plugin "SecurityContextDeny"
I1204 15:37:14.586752       1 plugins.go:84] Registered admission plugin "ServiceAccount"
I1204 15:37:14.586763       1 plugins.go:84] Registered admission plugin "DefaultStorageClass"
I1204 15:37:14.586773       1 plugins.go:84] Registered admission plugin "PersistentVolumeClaimResize"
I1204 15:37:14.586784       1 plugins.go:84] Registered admission plugin "StorageObjectInUseProtection"
F1204 15:37:44.610427       1 start_api.go:68] dial tcp 10.0.4.57:2379: connect: connection refused
Version
oc version
oc v3.11.0+62803d0-1
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://os-int.tremolo.local:443
openshift v3.11.0+06cfa24-67
kubernetes v1.11.0+d4cacc0
Steps To Reproduce
  1. deploy okd 3.10
  2. remove and disable ansible service broker
  3. upgrade to 3.11
Current Result

the api server dies after a few minutes

Expected Result

etcd doesn’t crash, api server doesn’t crash

Additional Information
cat /etc/redhat-release 
CentOS Linux release 7.6.1810 (Core)

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 15 (1 by maintainers)

Most upvoted comments

Redhat’s solution is to downgrading docker to version 1.13.1-75 (https://access.redhat.com/solutions/3734981) which fixed the issue with etcd crashing for me