kubernetes: EndPoint unavailable for a long time (almost 2h) for a correct Service
Is this a BUG REPORT or FEATURE REQUEST?: /kind bug
What happened: One of Endpoint objects was missing for a Service that had 1 matching Pod running. The Endpoint returned after 1h45m.
What you expected to happen: For the Endpoint to stay with us and be there
How to reproduce it (as minimally and precisely as possible): I have no idea, I even don’t know why it was fixed.
Anything else we need to know?:
Let me show some details.
We’re running a service that uses providers to watch for endpoints changes. There, for one service, I got the following event:
time="2018-01-05T11:01:33Z" level=info msg="Endpoints delete event: mongo-32-2"
The Deployment was running just 1 pod, so this had to be the last one.
I ran the following check to see what’s going on:
$ kubectl -n placeable-qa get svc/mongo-32-2 -o yaml
apiVersion: v1
kind: Service
...
name: mongo-32-2
namespace: placeable-qa
...
spec:
clusterIP: 10.111.196.30
ports:
- port: 27017
protocol: TCP
targetPort: 27017
selector:
app: mongo-32-2
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}
So, the Service is defined and has selector app=mongo-32-2
. Now, Deploy
$ kubectl -n placeable-qa get deploy/mongo-32-2 -o yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
...
name: mongo-32-2
namespace: placeable-qa
resourceVersion: "5884608"
...
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: mongo-32-2
...
status:
availableReplicas: 1
conditions:
- lastTransitionTime: 2018-01-05T03:02:44Z
lastUpdateTime: 2018-01-05T03:02:44Z
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
- lastTransitionTime: 2018-01-05T02:59:49Z
lastUpdateTime: 2018-01-05T03:02:44Z
message: ReplicaSet "mongo-32-2-855f6f7547" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
observedGeneration: 1
readyReplicas: 1
replicas: 1
updatedReplicas: 1
and pods
$ kubectl -n placeable-qa get po -l app=mongo-32-2
NAME READY STATUS RESTARTS AGE
mongo-32-2-855f6f7547-sn4ph 1/1 Running 0 9h
yet for endpoints
$ kubectl -n placeable-qa get ep mongo-32-2
Error from server (NotFound): endpoints "mongo-32-2" not found
Surprisingly, the endpoint came back after 1h45m !!! - this is the log from our service:
`time="2018-01-05T11:01:33Z" level=info msg="Endpoints delete event: mongo-32-2"`
...
`time="2018-01-05T12:45:10Z" level=info msg="New endpoints created event: mongo-32-2"`
Environment:
- Kubernetes version (use
kubectl version
):
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.5", GitCommit:"cce11c6a185279d037023e02ac5249e14daa22bf", GitTreeState:"clean", BuildDate:"2017-12-07T16:16:03Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"8+", GitVersion:"v1.8.5+coreos.0", GitCommit:"b8e596026feda7b97f4337b115d1a9a250afa8ac", GitTreeState:"clean", BuildDate:"2017-12-12T11:01:08Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
- Cloud provider or hardware configuration: AWS x1.32xlarge
- OS (e.g. from /etc/os-release):
NAME="Ubuntu"
VERSION="16.04.3 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.3 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
- Kernel (e.g.
uname -a
): Linux central-ansible 4.13.0-21-generic #24~16.04.1-Ubuntu SMP Mon Dec 18 19:39:31 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux - Install tools: kubespray + custom
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 12
- Comments: 20 (3 by maintainers)
Commits related to this issue
- avoid wrongly replaying old events in kube-controller-manager The resourceVersion in event stream doesn't monotonic increase, it's the version of related resource itself, not a sequence number of the... — committed to Dieken/kubernetes by Dieken 6 years ago
- 升级1.8.13 修复bug--https://github.com/kubernetes/kubernetes/issues/57897 — committed to foxchenlei/docker-library by foxchenlei 6 years ago
this is a duplicate of #58545, fixed in #58547
delete watch events now have the current etcd index, not the last index of the resource
this is fixed in v1.8.8+, v1.9.3+, and v1.10.0+
/close
Within the logs we also notice some services that have been deleted for several days that the service_controller and endpoint_controller seem to be obsessing over although they don’t exist anymore. Once such service is prod/private-vehicle-inventory-core-ingress-nginx.
It seems that all services for which we have this problem all generate this message occurring in the logs:
Since we have the situation occurring at the moment a little more information.
Funny thing while writing this the service has now come backup:
At the moment the endpoints reappeared, this popped into the logs:
We have been fighting with this problem over the last 3 weeks. We noticed the problem as some services would stop responding for a while and suddenly restart after 10/15 minutes. During the service outage, running a kubectl describe service <service> would return no endpoints.
We are running the following versions on AWS (without EKS):
In this log excerpt, you can see k8s service_controller and endpoints_controller obsessing over a core-reporting service that hasn’t been touched since yesterday afternoon. (the log entries are from this morning.
For us this is totally unrelated to the namespaces as we created the namespaces at cluster creation time and haven’t added or removed any since then. We do however have continuous deployment being used for our services and services are deployed multiple times a day.
I got similar issue on 1.9.2 and 1.8.2. I deployed k8s with kubeadm on VMs, Ubuntu 16.04, no cloud provider.
According to kube-apiserver audit log, I found endpoint-controller deleted and recreated endpoints very soon, often at the same second.
I doubt the service cache isn’t consistent with endpoint cache at https://github.com/kubernetes/kubernetes/blob/v1.9.2/pkg/controller/endpoint/endpoints_controller.go#L395, maybe commit 2fa93da6d5efd97dbcaad262a9e59073de9c5298 fixed it but I can’t recur this issue reliably.
After “docker restart” kube-controller-manager, it seems stable now. “kubectl delete pod” kube-controller-manager didn’t work, the container wasn’t recreated although “kubectl get pod” showed a very young age.
The picture was taken from
grep 'Complete.*grafana.*delete' /var/log/kubernetes/kube-audit.log | jq '......'
, you can see “delete & create” often happen at the same second, then suddenly a sole successful “delete”, no “create”, and then a failed “delete”.