kubernetes: endpoints not showing up for service

Is this a request for help? no.

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.): service endpoints missing


Is this a BUG REPORT or FEATURE REQUEST? (choose one): Bug Report

Kubernetes version (use kubectl version): 1.6.2

Environment:

  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release):
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1353.8.0
VERSION_ID=1353.8.0
BUILD_ID=2017-05-30-2322
PRETTY_NAME="Container Linux by CoreOS 1353.8.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
  • Kernel (e.g. uname -a): 4.9.24-coreos
  • Install tools: manual
  • Others:

What happened:

Service in kube-system namespace sees no endpoints that are available

What you expected to happen: For it to see available endpoints

How to reproduce it (as minimally and precisely as possible):

Not sure I can. It appeared on the latest remove/replace of worker nodes, but was fine before

Anything else we need to know:

I have a traefik Ingress controller defined and a service for it:

$ kubectl describe service traefik-ingress-controller -n kube-system
Name:			traefik-ingress-controller
Namespace:		kube-system
Labels:			app=traefik-ingress-controller
Annotations:		kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"app":"traefik-ingress-controller"},"name":"traefik-ingress-controller","nam...
Selector:		name=traefik-ingress-lb
Type:			NodePort
IP:			10.200.177.150
Port:			http	80/TCP
NodePort:		http	30080/TCP
Endpoints:
Port:			https	443/TCP
NodePort:		https	30443/TCP
Endpoints:
Session Affinity:	None
Events:			<none>

$ kubectl -n kube-system get pod --selector=name=traefik-ingress-lb
NAME                               READY     STATUS    RESTARTS   AGE
traefik-ingress-controller-8vlms   1/1       Running   0          1h
traefik-ingress-controller-c81cq   1/1       Running   0          1h
traefik-ingress-controller-s51mq   1/1       Running   0          59m

Selectors match, the pods are running just fine, yet the endpoints do not show up.

I suspect if I delete the service and recreate it, it probably will pick it up, but I shouldn’t have to, and am concerned it might occur again; rather actually debug it.

How can I check what kube is doing to affiliate it? How often does it check? Where?

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 4
  • Comments: 35 (20 by maintainers)

Most upvoted comments

A quick follow up to let you know we haven’t forgotten about this issue. From what we can tell, this is being caused by an Endpoints informer cache that can potentially be out of date when syncService() runs. Essentially syncService() would be called twice - once for the delete and once for the create event. There’s a chance that when the second run of syncService() happens the Endpoints cache may not have been updated with the latest run. If that happens, syncService() will not make any changes because it sees an endpoints resource that has all the right ports and addresses set.

@swetharepakula is working on a fix here that will mirror what we’ve done for the EndpointSlice controller with the EndpointSliceTracker. This will involve watching Endpoints changes and calling syncService() if the change has not already been accounted for by the controller.

As far as I can tell, this should be a relatively rare bug. It should only happen if the cache is out of date, the recreated Service is identical, and the Pods have not changed at all. Given the relative complexity of the fix here, I’m not sure how far we’ll be able to backport this. At the very least, we’re hoping to have a fix in for 1.19.

Got it, apparently realized that its not only the selector that has to match labels. The port names should be the same in the pod definition and services definition as well. The latter was the culprit in my case.

@githubvick Your service is using name=nginx-ingress while your pods are using app=nginx-ingress.

I think @swetharepakula’s PR is almost ready, sorry for not catching the lifecycle close earlier.

@sebgl Thanks for the detailed bug report! I’ll try to recreate this scenario with some new tests.

/assign