kubernetes: EndPoint unavailable for a long time (almost 2h) for a correct Service

Is this a BUG REPORT or FEATURE REQUEST?: /kind bug

What happened: One of Endpoint objects was missing for a Service that had 1 matching Pod running. The Endpoint returned after 1h45m.

What you expected to happen: For the Endpoint to stay with us and be there

How to reproduce it (as minimally and precisely as possible): I have no idea, I even don’t know why it was fixed.

Anything else we need to know?:

Let me show some details. We’re running a service that uses providers to watch for endpoints changes. There, for one service, I got the following event: time="2018-01-05T11:01:33Z" level=info msg="Endpoints delete event: mongo-32-2" The Deployment was running just 1 pod, so this had to be the last one.

I ran the following check to see what’s going on:

$ kubectl -n placeable-qa get svc/mongo-32-2 -o yaml
apiVersion: v1
kind: Service
...
  name: mongo-32-2
  namespace: placeable-qa
...
spec:
  clusterIP: 10.111.196.30
  ports:
  - port: 27017
    protocol: TCP
    targetPort: 27017
  selector:
    app: mongo-32-2
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

So, the Service is defined and has selector app=mongo-32-2. Now, Deploy

$ kubectl -n placeable-qa get deploy/mongo-32-2 -o yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
...
  name: mongo-32-2
  namespace: placeable-qa
  resourceVersion: "5884608"
 ...
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: mongo-32-2
...
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: 2018-01-05T03:02:44Z
    lastUpdateTime: 2018-01-05T03:02:44Z
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: 2018-01-05T02:59:49Z
    lastUpdateTime: 2018-01-05T03:02:44Z
    message: ReplicaSet "mongo-32-2-855f6f7547" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 1
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

and pods

$ kubectl -n placeable-qa get po -l app=mongo-32-2  
NAME                          READY     STATUS    RESTARTS   AGE
mongo-32-2-855f6f7547-sn4ph   1/1       Running   0          9h

yet for endpoints

$ kubectl -n placeable-qa get ep mongo-32-2
Error from server (NotFound): endpoints "mongo-32-2" not found

Surprisingly, the endpoint came back after 1h45m !!! - this is the log from our service:

`time="2018-01-05T11:01:33Z" level=info msg="Endpoints delete event: mongo-32-2"`
...
`time="2018-01-05T12:45:10Z" level=info msg="New endpoints created event: mongo-32-2"`

Environment:

Kubernetes version (use kubectl version):

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.5", GitCommit:"cce11c6a185279d037023e02ac5249e14daa22bf", GitTreeState:"clean", BuildDate:"2017-12-07T16:16:03Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"8+", GitVersion:"v1.8.5+coreos.0", GitCommit:"b8e596026feda7b97f4337b115d1a9a250afa8ac", GitTreeState:"clean", BuildDate:"2017-12-12T11:01:08Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration: AWS x1.32xlarge
OS (e.g. from /etc/os-release):

NAME="Ubuntu"
VERSION="16.04.3 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.3 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial

Kernel (e.g. uname -a): Linux central-ansible 4.13.0-21-generic #24~16.04.1-Ubuntu SMP Mon Dec 18 19:39:31 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Install tools: kubespray + custom

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 12
Comments: 20 (3 by maintainers)

Commits related to this issue

avoid wrongly replaying old events in kube-controller-manager The resourceVersion in event stream doesn't monotonic increase, it's the version of related resource itself, not a sequence number of the... — committed to Dieken/kubernetes by Dieken 6 years ago
升级1.8.13 修复bug--https://github.com/kubernetes/kubernetes/issues/57897 — committed to foxchenlei/docker-library by foxchenlei 6 years ago

Most upvoted comments

this is a duplicate of #58545, fixed in #58547

delete watch events now have the current etcd index, not the last index of the resource

this is fixed in v1.8.8+, v1.9.3+, and v1.10.0+

/close

liggitt on May 23, 2018

Within the logs we also notice some services that have been deleted for several days that the service_controller and endpoint_controller seem to be obsessing over although they don’t exist anymore. Once such service is prod/private-vehicle-inventory-core-ingress-nginx.

5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/microservice-catalog-vehicles to registry.
I0518 13:28:01.798798       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-crm to registry.
I0518 13:28:01.819749       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-reporting to registry.
I0518 13:28:01.819853       5 service_controller.go:754] Service has been deleted prod/private-vehicle-inventory-core-ingress-nginx
E0518 13:28:01.819870       5 service_controller.go:775] Failed to process service. Not retrying: Service prod/private-vehicle-inventory-core-ingress-nginx not in cache even though the watcher thought it was. Ignoring the deletion.
I0518 13:28:01.845959       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-configuration to registry.
I0518 13:28:01.867423       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-catalog to registry.
I0518 13:28:01.890048       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/backbone-account to registry.
I0518 13:28:01.890086       5 service_controller.go:754] Service has been deleted prod/backbone-socket
I0518 13:28:01.890104       5 service_controller.go:754] Service has been deleted prod/httpd-router
E0518 13:28:01.890132       5 service_controller.go:775] Failed to process service. Not retrying: Service prod/httpd-router not in cache even though the watcher thought it was. Ignoring the deletion.
I0518 13:28:01.890230       5 service_controller.go:754] Service has been deleted prod/webauto-ingress-nginx
E0518 13:28:01.890246       5 service_controller.go:775] Failed to process service. Not retrying: Service prod/webauto-ingress-nginx not in cache even though the watcher thought it was. Ignoring the deletion.
I0518 13:35:43.670507       5 service_controller.go:754] Service has been deleted prod/microservice-catalog-vehicles-deployment
I0518 13:35:43.670783       5 service_controller.go:754] Service has been deleted prod/microservice-catalog-vehicles
I0518 13:35:43.707909       5 endpoints_controller.go:375] Error syncing endpoints for service "prod/core-crm": endpoints "core-crm" already exists
I0518 13:35:43.755312       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/microservice-catalog-vehicles to registry.
I0518 13:35:43.781406       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-crm to registry.
I0518 13:35:43.808421       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-reporting to registry.
I0518 13:35:43.808602       5 service_controller.go:754] Service has been deleted prod/private-vehicle-inventory-core-ingress-nginx
E0518 13:35:43.808623       5 service_controller.go:775] Failed to process service. Not retrying: Service prod/private-vehicle-inventory-core-ingress-nginx not in cache even though the watcher thought it was. Ignoring the deletion.
I0518 13:35:43.836490       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-configuration to registry.
I0518 13:35:43.854520       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-catalog to registry.
I0518 13:35:43.872177       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/backbone-account to registry.
I0518 13:35:43.872208       5 service_controller.go:754] Service has been deleted prod/backbone-socket
I0518 13:35:43.872272       5 service_controller.go:754] Service has been deleted prod/httpd-router
E0518 13:35:43.872286       5 service_controller.go:775] Failed to process service. Not retrying: Service prod/httpd-router not in cache even though the watcher thought it was. Ignoring the deletion.
I0518 13:35:43.872489       5 service_controller.go:754] Service has been deleted prod/webauto-ingress-nginx
E0518 13:35:43.872512       5 service_controller.go:775] Failed to process service. Not retrying: Service prod/webauto-ingress-nginx not in cache even though the watcher thought it was. Ignoring the deletion.

It seems that all services for which we have this problem all generate this message occurring in the logs:

Not persisting unchanged LoadBalancerStatus for service prod/backbone-account to registry.

jfgadoury on May 18, 2018

Since we have the situation occurring at the moment a little more information.

kubectl get ep core-reporting
Error from server (NotFound): endpoints "core-reporting" not found

kubectl get pods | grep core-reporting
core-reporting-599f9584bc-5rrwr                           1/1       Running   0          18h
core-reporting-599f9584bc-755md                           1/1       Running   0          18h
core-reporting-599f9584bc-bphs9                           1/1       Running   0          18h

kubectl describe pod core-reporting-599f9584bc-755md
Name:           core-reporting-599f9584bc-755md
Namespace:      prod
Node:          <redacted valid node>
Start Time:     Thu, 17 May 2018 15:12:18 -0400
Labels:         app=core-reporting
                pod-template-hash=1559514067
Annotations:    kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"prod","name":"core-reporting-599f9584bc","uid":"b505667f-5a05-11e8-b25a-0215c9f29...
Status:         Running
IP:             <redacted valid ip>
Controlled By:  ReplicaSet/core-reporting-599f9584bc
Containers:
  core-reporting:
    Container ID:   docker://8053351fb7e0ec4823364b011cf7a277eb750317f0edfc60d2c486ce338eda07
    Image:          <redacted valid image>
    Image ID:       <redacted valid image id>
    Port:           80/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Thu, 17 May 2018 15:12:23 -0400
    Ready:          True
    Restart Count:  0
    Limits:

Funny thing while writing this the service has now come backup:

kubectl get ep core-reporting
NAME             ENDPOINTS                                        AGE
core-reporting   x.x.x.x:80,x.x.x.x:80,x.x.x.x:80   3m

kubectl get pods | grep core-reporting
core-reporting-599f9584bc-5rrwr                           1/1       Running   0          18h
core-reporting-599f9584bc-755md                           1/1       Running   0          18h
core-reporting-599f9584bc-bphs9                           1/1       Running   0          18h

kubectl describe pod core-reporting-599f9584bc-755md
Name:           core-reporting-599f9584bc-755md
Namespace:      prod
Node:          <redacted>
Start Time:     Thu, 17 May 2018 15:12:18 -0400
Labels:         app=core-reporting
                pod-template-hash=1559514067
Annotations:    kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"prod","name":"core-reporting-599f9584bc","uid":"b505667f-5a05-11e8-b25a-0215c9f29...
Status:         Running
IP:             <redacted>
Controlled By:  ReplicaSet/core-reporting-599f9584bc
Containers:
  core-reporting:
    Container ID:   docker://8053351fb7e0ec4823364b011cf7a277eb750317f0edfc60d2c486ce338eda07
    Image:          <redacted>
    Image ID:       <redacted>
    Port:           80/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Thu, 17 May 2018 15:12:23 -0400
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     600m
      memory:  1536Mi
    Requests:
      cpu:        150m
      memory:     1Gi
    Readiness:    http-get http://:80/health delay=150s timeout=5s period=15s #success=3 #failure=10
    Environment:  <none>
    Mounts:
      <redacted> (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          True
  PodScheduled   True
Volumes:
  default-token-85npg:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  <redacted>
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.alpha.kubernetes.io/notReady:NoExecute for 300s
                 node.alpha.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>

At the moment the endpoints reappeared, this popped into the logs:

root@<redacted>:/var/log# tail -f *.log | grep reporting
I0518 13:28:01.676949       5 wrap.go:42] DELETE /api/v1/namespaces/prod/endpoints/core-reporting: (1.038787ms) 404 [[kube-controller-manager/v1.8.6 (linux/amd64) kubernetes/6260bb0/system:serviceaccount:kube-system:endpoint-controller] 127.0.0.1:58700]
I0518 13:28:01.819749       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-reporting to registry.

jfgadoury on May 18, 2018

We have been fighting with this problem over the last 3 weeks. We noticed the problem as some services would stop responding for a while and suddenly restart after 10/15 minutes. During the service outage, running a kubectl describe service <service> would return no endpoints.

We are running the following versions on AWS (without EKS):

Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.0", GitCommit:"fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState:"clean", BuildDate:"2018-03-27T00:13:02Z", GoVersion:"go1.9.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.6", GitCommit:"6260bb08c46c31eea6cb538b34a9ceb3e406689c", GitTreeState:"clean", BuildDate:"2017-12-21T06:23:29Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

In this log excerpt, you can see k8s service_controller and endpoints_controller obsessing over a core-reporting service that hasn’t been touched since yesterday afternoon. (the log entries are from this morning.

I0518 09:13:27.658191       5 endpoints_controller.go:375] Error syncing endpoints for service "prod/core-reporting": endpoints "core-reporting" already exists
I0518 09:13:27.707027       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-reporting to registry.
I0518 09:20:02.765175       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-reporting to registry.
I0518 09:25:46.860493       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-reporting to registry.
I0518 09:35:43.871176       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-reporting to registry.
I0518 09:41:30.723506       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-reporting to registry.
I0518 09:51:12.718223       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-reporting to registry.
I0518 09:59:49.729006       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-reporting to registry.
I0518 10:06:09.669459       5 endpoints_controller.go:375] Error syncing endpoints for service "prod/core-reporting": endpoints "core-reporting" already exists
I0518 10:06:09.846275       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-reporting to registry.
I0518 10:15:55.837516       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-reporting to registry.
I0518 10:22:22.727540       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-reporting to registry.
I0518 10:31:47.703310       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-reporting to registry.
I0518 10:38:08.784047       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-reporting to registry.
I0518 10:46:34.720786       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-reporting to registry.
I0518 10:55:02.857563       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-reporting to registry.
I0518 11:02:13.793458       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-reporting to registry.
I0518 11:10:07.722569       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-reporting to registry.
I0518 11:20:06.849455       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-reporting to registry.
I0518 11:27:39.797324       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-reporting to registry.
I0518 11:33:40.724860       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-reporting to registry.
I0518 11:38:53.843699       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-reporting to registry.
I0518 11:47:02.841302       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-reporting to registry.
I0518 11:56:08.802004       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-reporting to registry.
I0518 12:06:07.867587       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-reporting to registry.
I0518 12:14:00.808324       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-reporting to registry.
I0518 12:20:32.693247       5 endpoints_controller.go:375] Error syncing endpoints for service "prod/core-reporting": endpoints "core-reporting" already exists
I0518 12:20:32.802125       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-reporting to registry.
I0518 12:28:37.779667       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-reporting to registry.
I0518 12:33:40.746372       5 service_controller.go:323] Not persisting unchanged LoadBalancerStatus for service prod/core-reporting to registry.

For us this is totally unrelated to the namespaces as we created the namespaces at cluster creation time and haven’t added or removed any since then. We do however have continuous deployment being used for our services and services are deployed multiple times a day.

jfgadoury on May 18, 2018

I got similar issue on 1.9.2 and 1.8.2. I deployed k8s with kubeadm on VMs, Ubuntu 16.04, no cloud provider.

According to kube-apiserver audit log, I found endpoint-controller deleted and recreated endpoints very soon, often at the same second.

I doubt the service cache isn’t consistent with endpoint cache at https://github.com/kubernetes/kubernetes/blob/v1.9.2/pkg/controller/endpoint/endpoints_controller.go#L395, maybe commit 2fa93da6d5efd97dbcaad262a9e59073de9c5298 fixed it but I can’t recur this issue reliably.

After “docker restart” kube-controller-manager, it seems stable now. “kubectl delete pod” kube-controller-manager didn’t work, the container wasn’t recreated although “kubectl get pod” showed a very young age.

The picture was taken from grep 'Complete.*grafana.*delete' /var/log/kubernetes/kube-audit.log | jq '......', you can see “delete & create” often happen at the same second, then suddenly a sole successful “delete”, no “create”, and then a failed “delete”.

Dieken on Apr 26, 2018