rancher: Deployed pods missing workloadID label - no active endpoints - 503 error

Rancher versions: rancher/server or rancher/rancher: 2.1.0 rancher/agent or rancher/rancher-agent: 2.1.0

Infrastructure Stack versions: healthcheck: ipsec: network-services: scheduler: kubernetes (if applicable): 1.12.0

Docker version: (docker version,docker info preferred)

Operating system and kernel: (cat /etc/os-release, uname -r preferred)

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)

Setup details: (single node rancher vs. HA rancher, internal DB vs. external DB)

Single node rancher.

Environment Template: (Cattle/Kubernetes/Swarm/Mesos)

Kubernetes

Steps to Reproduce:

(originally filed as https://github.com/kubernetes/kubernetes/issues/69563 but I now suspect that the missing label may be related to R2)

We deploy a new version of our app by changing the spec.template.spec.containers[0].image attribute of the Deployment YAML, as described in the documentation for Deployment controllers…

The Deployment YAML looks like this:

apiVersion: apps/v1beta2
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "17"
    field.cattle.io/creatorId: user-5jgmc
  creationTimestamp: 2018-10-02T04:54:33Z
  generation: 45
  labels:
    workload.user.cattle.io/workloadselector: deployment-cms-app
  name: app
  namespace: cms
  resourceVersion: "175227"
  selfLink: /apis/apps/v1beta2/namespaces/cms/deployments/app
  uid: 40eeb300-c5ff-11e8-91dc-001b21dc82ba
spec:
  minReadySeconds: 5
  progressDeadlineSeconds: 60
  replicas: 3
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      workload.user.cattle.io/workloadselector: deployment-cms-app
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      annotations:
        workload.cattle.io/state: '{"b3ZoLWJtLTE=":"c-gw6hx:m-190d5b1abdb1","b3ZoLWJtLTM=":"c-gw6hx:m-93ddef52ec17","b3ZoLWRiLTE=":"c-mjbqh:m-cfa61f40f7d7"}'
      creationTimestamp: null
      labels:
        workload.user.cattle.io/workloadselector: deployment-cms-app
    spec:
      affinity: {}
      containers:
      - env:
        - <redacted>
        image: registry.ourdomain.com:5000/namespace/app:31848589
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /robots.txt
            port: 80
            scheme: HTTP
          initialDelaySeconds: 2
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 5
        name: app
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /robots.txt
            port: 80
            scheme: HTTP
          initialDelaySeconds: 2
          periodSeconds: 5
          successThreshold: 2
          timeoutSeconds: 5
        resources: {}
        securityContext:
          allowPrivilegeEscalation: false
          privileged: false
          readOnlyRootFilesystem: false
          runAsNonRoot: false
        stdin: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        tty: true
        volumeMounts:
        - <redacted>
      dnsPolicy: ClusterFirst
      imagePullSecrets:
      - name: registry-secret
      nodeName: ovh-app-1
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - <redacted>
status:
  availableReplicas: 3
  conditions:
  - lastTransitionTime: 2018-10-02T05:18:54Z
    lastUpdateTime: 2018-10-02T05:18:54Z
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: 2018-10-03T03:48:46Z
    lastUpdateTime: 2018-10-03T04:48:25Z
    message: ReplicaSet "app-5bf7dbc69f" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 45
  readyReplicas: 3
  replicas: 3
  updatedReplicas: 3

I can see the Deployments, ReplicaSets and Services as expected.

$ kubectl get deployment -n cms
NAME    DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
app     3         3         3            3           18h
redis   1         1         1            1           19h
$ kubectl get replicaset -n cms
NAME              DESIRED   CURRENT   READY   AGE
app-5bf7dbc69f    3         3         3       9h
app-7dc677d665    0         0         0       17h
app-849cc7c58d    0         0         0       18h
app-dd6cf6698     0         0         0       17h
redis-66985bf6c   1         1         1       19h
$ kubectl get service -n cms 
NAME                                       TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
app                                        ClusterIP   None            <none>        42/TCP    20h
ingress-e213c2b4c622329de7aa2c0c28dc37e5   ClusterIP   10.43.216.158   <none>        80/TCP    16s
redis                                      ClusterIP   None            <none>        42/TCP    21h

$ kubectl get service -n cms ingress-e213c2b4c622329de7aa2c0c28dc37e5 -o yaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    field.cattle.io/targetWorkloadIds: '["deployment:cms:app"]'
  creationTimestamp: 2018-10-09T08:39:00Z
  labels:
    cattle.io/creator: norman
  name: ingress-e213c2b4c622329de7aa2c0c28dc37e5
  namespace: cms
  ownerReferences:
  - apiVersion: v1beta1/extensions
    controller: true
    kind: Ingress
    name: cms
    uid: 6617bfe4-c63f-11e8-b01c-9e111c023110
  resourceVersion: "1479746"
  selfLink: /api/v1/namespaces/cms/services/ingress-e213c2b4c622329de7aa2c0c28dc37e5
  uid: c4d9594b-cb9e-11e8-a6e1-9e111c023110
spec:
  clusterIP: 10.43.216.158
  ports:
  - port: 80
    protocol: TCP
    targetPort: 80
  selector:
    workloadID_ingress-e213c2b4c622329de7aa2c0c28dc37e5: "true"
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

$ kubectl get ingress -n cms cms -o yaml
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    field.cattle.io/creatorId: user-5jgmc
    field.cattle.io/ingressState: '{"Y21zL2Ntcy9kZXYuY21zLmFnZW50ZGVzaWduLmNvLnVrLy8vODA=":"deployment:cms:app","Y21zLWRldi1sZQ==":"cms:cms-dev-le"}'
    field.cattle.io/publicEndpoints: '[{"addresses":["94.237.50.126"],"port":443,"protocol":"HTTPS","serviceName":"cms:ingress-e213c2b4c622329de7aa2c0c28dc37e5","ingressName":"cms:cms","hostname":"dev.cms.agentdesign.co.uk","path":"/","allNodes":true}]'
  creationTimestamp: 2018-10-02T12:33:43Z
  generation: 15
  name: cms
  namespace: cms
  resourceVersion: "1479727"
  selfLink: /apis/extensions/v1beta1/namespaces/cms/ingresses/cms
  uid: 6617bfe4-c63f-11e8-b01c-9e111c023110
spec:
  rules:
  - host: dev.cms.agentdesign.co.uk
    http:
      paths:
      - backend:
          serviceName: ingress-e213c2b4c622329de7aa2c0c28dc37e5
          servicePort: 80
        path: /
  tls:
  - hosts:
    - dev.cms.agentdesign.co.uk
    secretName: cms-dev-le
status:
  loadBalancer:
    ingress:
    - ip: 94.237.50.126
    - ip: 94.237.51.162
    - ip: 94.237.54.20
    - ip: 94.237.54.24
    - ip: 94.237.54.26
$ kubectl get endpoints -n cms ingress-e213c2b4c622329de7aa2c0c28dc37e5
NAME                                       ENDPOINTS                                      AGE
ingress-e213c2b4c622329de7aa2c0c28dc37e5   10.42.0.157:80,10.42.3.146:80,10.42.4.168:80   10m

The Deployment scales down the ‘old’ ReplicaSet and scales up the ‘new’ ReplicaSet. I can see this is happening as expected.

$ kubectl get replicaset -n cms
NAME              DESIRED   CURRENT   READY   AGE
app-5bf7dbc69f    2         2         2       9h
app-7dc677d665    2         2         1       17h
app-849cc7c58d    0         0         0       18h
app-dd6cf6698     0         0         0       17h
redis-66985bf6c   1         1         1       19h

After a few seconds it’s fully scaled…

$ kubectl get replicaset -n cms
NAME              DESIRED   CURRENT   READY   AGE
app-5bf7dbc69f    0         0         0       9h
app-7dc677d665    3         3         3       17h
app-849cc7c58d    0         0         0       18h
app-dd6cf6698     0         0         0       17h
redis-66985bf6c   1         1         1       19h

However, at this point the site is showing the 503 Service Temporarily Unavailable (nginx/1.13.12), because the nginx-ingress logs show that the ingress has been taken out of the generated nginx configurations…

W1009 07:01:57.954692       5 controller.go:769] Service "cms/ingress-e213c2b4c622329de7aa2c0c28dc37e5" does not have any active Endpoint.
I1009 07:01:57.955086       5 controller.go:169] Configuration changes detected, backend reload required.
I1009 07:01:57.955154       5 util.go:68] rlimit.max=1048576
I1009 07:01:57.955184       5 nginx.go:519] Maximum number of open file descriptors: 1047552
I1009 07:01:58.051641       5 nginx.go:626] NGINX configuration diff:
--- /etc/nginx/nginx.conf       2018-10-09 07:01:48.210045444 +0000
+++ /tmp/new-nginx-cfg780268064 2018-10-09 07:01:58.046093899 +0000
@@ -211,21 +211,12 @@
                
                keepalive 32;
                
-               server 10.42.3.138:80 max_fails=0 fail_timeout=0;
                server 10.42.0.148:80 max_fails=0 fail_timeout=0;
+               server 10.42.3.138:80 max_fails=0 fail_timeout=0;
                server 10.42.4.157:80 max_fails=0 fail_timeout=0;
                
        }
        
-       upstream cms-ingress-e213c2b4c622329de7aa2c0c28dc37e5-80 {
-               least_conn;
-               
-               keepalive 32;
-               
-               server 10.42.4.167:80 max_fails=0 fail_timeout=0;
-               
-       }
-       
        upstream db-ingress-231cd0bcc1b631a6142a515c3a0858e8-80 {
                least_conn;
                
@@ -657,7 +648,7 @@
                        
                        port_in_redirect off;
                        
-                       set $proxy_upstream_name "cms-ingress-e213c2b4c622329de7aa2c0c28dc37e5-80";
+                       set $proxy_upstream_name "";
                        
                        # enforce ssl on server side
                        if ($redirect_to_https) {
@@ -717,9 +708,8 @@
                        proxy_next_upstream                     error timeout;
                        proxy_next_upstream_tries               3;
                        
-                       proxy_pass http://cms-ingress-e213c2b4c622329de7aa2c0c28dc37e5-80;
-                       
-                       proxy_redirect                          off;
+                       # No endpoints available for the request
+                       return 503;
                        
                }
                
I1009 07:01:58.117524       5 controller.go:179] Backend successfully reloaded.

The reason for the ‘Service … does not have any active Endpoint’ is as, according to the docs, because the ‘endpoints controller has [not] found the correct Pods for your Service’.

$ kubectl get endpoints -n cms ingress-e213c2b4c622329de7aa2c0c28dc37e5
NAME                                       ENDPOINTS   AGE
ingress-e213c2b4c622329de7aa2c0c28dc37e5   <none>      12m

The advice given assumes the spec.selector field of the Service is not matching the metadata.labels values on your Pods. The spec.selector is:

spec:
  selector:
    workloadID_ingress-e213c2b4c622329de7aa2c0c28dc37e5: "true"

The metadata.labels on the Pods are:

metadata:
  labels:
    pod-template-hash: "1693867259"
    workload.user.cattle.io/workloadselector: deployment-cms-app

So, the docs are correct. But I’m still not sure why the label is not being set correctly on the workload.

What you expected to happen:

The endpoint controller to find active endpoints.

How to reproduce it (as minimally and precisely as possible):

The same issue occurs on both our production and development clusters and was still present after we rebuilt both as fresh clusters and migrated things across. I suspect it’s reproducible elsewhere.

Anything else we need to know?:

Not that I can think of currently.

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.1", GitCommit:"4ed3216f3ec431b140b1d899130a69fc671678f4", GitTreeState:"clean", BuildDate:"2018-10-05T16:46:06Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.3", GitCommit:"a4529464e4629c21224b3d52edfe0ea91b072862", GitTreeState:"clean", BuildDate:"2018-09-09T17:53:03Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
Production - 3x Bare metal hosts with 64GB
Development - 5x Standard VMs with 8GB
  • OS (e.g. from /etc/os-release):
VERSION="18.04.1 LTS (Bionic Beaver)"
  • Kernel (e.g. uname -a):
Linux <hostnameredacted> 4.15.0-36-generic #39-Ubuntu SMP Mon Sep 24 16:19:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

  • Install tools:

Clusters were deployed using Rancher2 (v2.0.8).

  • Others:

N/A

Results:

Deployed service ends up with no active endpoints, causing nginx-ingress

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 22 (3 by maintainers)

Most upvoted comments

Let’s keep this issue open

Hey guys. I’ve been kept from release and test an important product for my company and this is causing a big problem for my team. Could you please give me some advice about this issue?

Thanks

Hi @otreda

Can you provide the yaml file for the service of the ingress, please? image