rancher: Deployed pods missing workloadID label - no active endpoints - 503 error

Rancher versions: rancher/server or rancher/rancher: 2.1.0 rancher/agent or rancher/rancher-agent: 2.1.0

Infrastructure Stack versions: healthcheck: ipsec: network-services: scheduler: kubernetes (if applicable): 1.12.0

Docker version: (docker version,docker info preferred)

Operating system and kernel: (cat /etc/os-release, uname -r preferred)

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)

Setup details: (single node rancher vs. HA rancher, internal DB vs. external DB)

Single node rancher.

Environment Template: (Cattle/Kubernetes/Swarm/Mesos)

Kubernetes

Steps to Reproduce:

(originally filed as https://github.com/kubernetes/kubernetes/issues/69563 but I now suspect that the missing label may be related to R2)

We deploy a new version of our app by changing the spec.template.spec.containers[0].image attribute of the Deployment YAML, as described in the documentation for Deployment controllers…

The Deployment YAML looks like this:

apiVersion: apps/v1beta2
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "17"
    field.cattle.io/creatorId: user-5jgmc
  creationTimestamp: 2018-10-02T04:54:33Z
  generation: 45
  labels:
    workload.user.cattle.io/workloadselector: deployment-cms-app
  name: app
  namespace: cms
  resourceVersion: "175227"
  selfLink: /apis/apps/v1beta2/namespaces/cms/deployments/app
  uid: 40eeb300-c5ff-11e8-91dc-001b21dc82ba
spec:
  minReadySeconds: 5
  progressDeadlineSeconds: 60
  replicas: 3
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      workload.user.cattle.io/workloadselector: deployment-cms-app
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      annotations:
        workload.cattle.io/state: '{"b3ZoLWJtLTE=":"c-gw6hx:m-190d5b1abdb1","b3ZoLWJtLTM=":"c-gw6hx:m-93ddef52ec17","b3ZoLWRiLTE=":"c-mjbqh:m-cfa61f40f7d7"}'
      creationTimestamp: null
      labels:
        workload.user.cattle.io/workloadselector: deployment-cms-app
    spec:
      affinity: {}
      containers:
      - env:
        - <redacted>
        image: registry.ourdomain.com:5000/namespace/app:31848589
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /robots.txt
            port: 80
            scheme: HTTP
          initialDelaySeconds: 2
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 5
        name: app
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /robots.txt
            port: 80
            scheme: HTTP
          initialDelaySeconds: 2
          periodSeconds: 5
          successThreshold: 2
          timeoutSeconds: 5
        resources: {}
        securityContext:
          allowPrivilegeEscalation: false
          privileged: false
          readOnlyRootFilesystem: false
          runAsNonRoot: false
        stdin: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        tty: true
        volumeMounts:
        - <redacted>
      dnsPolicy: ClusterFirst
      imagePullSecrets:
      - name: registry-secret
      nodeName: ovh-app-1
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - <redacted>
status:
  availableReplicas: 3
  conditions:
  - lastTransitionTime: 2018-10-02T05:18:54Z
    lastUpdateTime: 2018-10-02T05:18:54Z
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: 2018-10-03T03:48:46Z
    lastUpdateTime: 2018-10-03T04:48:25Z
    message: ReplicaSet "app-5bf7dbc69f" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 45
  readyReplicas: 3
  replicas: 3
  updatedReplicas: 3

I can see the Deployments, ReplicaSets and Services as expected.

$ kubectl get deployment -n cms
NAME    DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
app     3         3         3            3           18h
redis   1         1         1            1           19h

$ kubectl get replicaset -n cms
NAME              DESIRED   CURRENT   READY   AGE
app-5bf7dbc69f    3         3         3       9h
app-7dc677d665    0         0         0       17h
app-849cc7c58d    0         0         0       18h
app-dd6cf6698     0         0         0       17h
redis-66985bf6c   1         1         1       19h

$ kubectl get service -n cms 
NAME                                       TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
app                                        ClusterIP   None            <none>        42/TCP    20h
ingress-e213c2b4c622329de7aa2c0c28dc37e5   ClusterIP   10.43.216.158   <none>        80/TCP    16s
redis                                      ClusterIP   None            <none>        42/TCP    21h

$ kubectl get service -n cms ingress-e213c2b4c622329de7aa2c0c28dc37e5 -o yaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    field.cattle.io/targetWorkloadIds: '["deployment:cms:app"]'
  creationTimestamp: 2018-10-09T08:39:00Z
  labels:
    cattle.io/creator: norman
  name: ingress-e213c2b4c622329de7aa2c0c28dc37e5
  namespace: cms
  ownerReferences:
  - apiVersion: v1beta1/extensions
    controller: true
    kind: Ingress
    name: cms
    uid: 6617bfe4-c63f-11e8-b01c-9e111c023110
  resourceVersion: "1479746"
  selfLink: /api/v1/namespaces/cms/services/ingress-e213c2b4c622329de7aa2c0c28dc37e5
  uid: c4d9594b-cb9e-11e8-a6e1-9e111c023110
spec:
  clusterIP: 10.43.216.158
  ports:
  - port: 80
    protocol: TCP
    targetPort: 80
  selector:
    workloadID_ingress-e213c2b4c622329de7aa2c0c28dc37e5: "true"
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

$ kubectl get ingress -n cms cms -o yaml
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    field.cattle.io/creatorId: user-5jgmc
    field.cattle.io/ingressState: '{"Y21zL2Ntcy9kZXYuY21zLmFnZW50ZGVzaWduLmNvLnVrLy8vODA=":"deployment:cms:app","Y21zLWRldi1sZQ==":"cms:cms-dev-le"}'
    field.cattle.io/publicEndpoints: '[{"addresses":["94.237.50.126"],"port":443,"protocol":"HTTPS","serviceName":"cms:ingress-e213c2b4c622329de7aa2c0c28dc37e5","ingressName":"cms:cms","hostname":"dev.cms.agentdesign.co.uk","path":"/","allNodes":true}]'
  creationTimestamp: 2018-10-02T12:33:43Z
  generation: 15
  name: cms
  namespace: cms
  resourceVersion: "1479727"
  selfLink: /apis/extensions/v1beta1/namespaces/cms/ingresses/cms
  uid: 6617bfe4-c63f-11e8-b01c-9e111c023110
spec:
  rules:
  - host: dev.cms.agentdesign.co.uk
    http:
      paths:
      - backend:
          serviceName: ingress-e213c2b4c622329de7aa2c0c28dc37e5
          servicePort: 80
        path: /
  tls:
  - hosts:
    - dev.cms.agentdesign.co.uk
    secretName: cms-dev-le
status:
  loadBalancer:
    ingress:
    - ip: 94.237.50.126
    - ip: 94.237.51.162
    - ip: 94.237.54.20
    - ip: 94.237.54.24
    - ip: 94.237.54.26

$ kubectl get endpoints -n cms ingress-e213c2b4c622329de7aa2c0c28dc37e5
NAME                                       ENDPOINTS                                      AGE
ingress-e213c2b4c622329de7aa2c0c28dc37e5   10.42.0.157:80,10.42.3.146:80,10.42.4.168:80   10m

The Deployment scales down the ‘old’ ReplicaSet and scales up the ‘new’ ReplicaSet. I can see this is happening as expected.

$ kubectl get replicaset -n cms
NAME              DESIRED   CURRENT   READY   AGE
app-5bf7dbc69f    2         2         2       9h
app-7dc677d665    2         2         1       17h
app-849cc7c58d    0         0         0       18h
app-dd6cf6698     0         0         0       17h
redis-66985bf6c   1         1         1       19h

After a few seconds it’s fully scaled…

$ kubectl get replicaset -n cms
NAME              DESIRED   CURRENT   READY   AGE
app-5bf7dbc69f    0         0         0       9h
app-7dc677d665    3         3         3       17h
app-849cc7c58d    0         0         0       18h
app-dd6cf6698     0         0         0       17h
redis-66985bf6c   1         1         1       19h

However, at this point the site is showing the 503 Service Temporarily Unavailable (nginx/1.13.12), because the nginx-ingress logs show that the ingress has been taken out of the generated nginx configurations…

W1009 07:01:57.954692       5 controller.go:769] Service "cms/ingress-e213c2b4c622329de7aa2c0c28dc37e5" does not have any active Endpoint.
I1009 07:01:57.955086       5 controller.go:169] Configuration changes detected, backend reload required.
I1009 07:01:57.955154       5 util.go:68] rlimit.max=1048576
I1009 07:01:57.955184       5 nginx.go:519] Maximum number of open file descriptors: 1047552
I1009 07:01:58.051641       5 nginx.go:626] NGINX configuration diff:
--- /etc/nginx/nginx.conf       2018-10-09 07:01:48.210045444 +0000
+++ /tmp/new-nginx-cfg780268064 2018-10-09 07:01:58.046093899 +0000
@@ -211,21 +211,12 @@
                
                keepalive 32;
                
-               server 10.42.3.138:80 max_fails=0 fail_timeout=0;
                server 10.42.0.148:80 max_fails=0 fail_timeout=0;
+               server 10.42.3.138:80 max_fails=0 fail_timeout=0;
                server 10.42.4.157:80 max_fails=0 fail_timeout=0;
                
        }
        
-       upstream cms-ingress-e213c2b4c622329de7aa2c0c28dc37e5-80 {
-               least_conn;
-               
-               keepalive 32;
-               
-               server 10.42.4.167:80 max_fails=0 fail_timeout=0;
-               
-       }
-       
        upstream db-ingress-231cd0bcc1b631a6142a515c3a0858e8-80 {
                least_conn;
                
@@ -657,7 +648,7 @@
                        
                        port_in_redirect off;
                        
-                       set $proxy_upstream_name "cms-ingress-e213c2b4c622329de7aa2c0c28dc37e5-80";
+                       set $proxy_upstream_name "";
                        
                        # enforce ssl on server side
                        if ($redirect_to_https) {
@@ -717,9 +708,8 @@
                        proxy_next_upstream                     error timeout;
                        proxy_next_upstream_tries               3;
                        
-                       proxy_pass http://cms-ingress-e213c2b4c622329de7aa2c0c28dc37e5-80;
-                       
-                       proxy_redirect                          off;
+                       # No endpoints available for the request
+                       return 503;
                        
                }
                
I1009 07:01:58.117524       5 controller.go:179] Backend successfully reloaded.

The reason for the ‘Service … does not have any active Endpoint’ is as, according to the docs, because the ‘endpoints controller has [not] found the correct Pods for your Service’.

$ kubectl get endpoints -n cms ingress-e213c2b4c622329de7aa2c0c28dc37e5
NAME                                       ENDPOINTS   AGE
ingress-e213c2b4c622329de7aa2c0c28dc37e5   <none>      12m

The advice given assumes the spec.selector field of the Service is not matching the metadata.labels values on your Pods. The spec.selector is:

spec:
  selector:
    workloadID_ingress-e213c2b4c622329de7aa2c0c28dc37e5: "true"

The metadata.labels on the Pods are:

metadata:
  labels:
    pod-template-hash: "1693867259"
    workload.user.cattle.io/workloadselector: deployment-cms-app

So, the docs are correct. But I’m still not sure why the label is not being set correctly on the workload.

What you expected to happen:

The endpoint controller to find active endpoints.

How to reproduce it (as minimally and precisely as possible):

The same issue occurs on both our production and development clusters and was still present after we rebuilt both as fresh clusters and migrated things across. I suspect it’s reproducible elsewhere.

Anything else we need to know?:

Not that I can think of currently.

Environment:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.1", GitCommit:"4ed3216f3ec431b140b1d899130a69fc671678f4", GitTreeState:"clean", BuildDate:"2018-10-05T16:46:06Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.3", GitCommit:"a4529464e4629c21224b3d52edfe0ea91b072862", GitTreeState:"clean", BuildDate:"2018-09-09T17:53:03Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration:

Production - 3x Bare metal hosts with 64GB
Development - 5x Standard VMs with 8GB

OS (e.g. from /etc/os-release):

VERSION="18.04.1 LTS (Bionic Beaver)"

Kernel (e.g. uname -a):

Linux <hostnameredacted> 4.15.0-36-generic #39-Ubuntu SMP Mon Sep 24 16:19:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Install tools:

Clusters were deployed using Rancher2 (v2.0.8).

Others:

N/A

Results:

Deployed service ends up with no active endpoints, causing nginx-ingress

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 22 (3 by maintainers)

Most upvoted comments

Let’s keep this issue open

loganhz on Nov 3, 2018

Hey guys. I’ve been kept from release and test an important product for my company and this is causing a big problem for my team. Could you please give me some advice about this issue?

Thanks

debianco on Jul 22, 2019

Hi @otreda

Can you provide the yaml file for the service of the ingress, please?

loganhz on Jan 28, 2019