kubernetes: endpoints not showing up for service
Is this a request for help? no.
What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.): service endpoints missing
Is this a BUG REPORT or FEATURE REQUEST? (choose one): Bug Report
Kubernetes version (use kubectl version
): 1.6.2
Environment:
- Cloud provider or hardware configuration: AWS
- OS (e.g. from /etc/os-release):
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1353.8.0
VERSION_ID=1353.8.0
BUILD_ID=2017-05-30-2322
PRETTY_NAME="Container Linux by CoreOS 1353.8.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
- Kernel (e.g.
uname -a
):4.9.24-coreos
- Install tools: manual
- Others:
What happened:
Service in kube-system
namespace sees no endpoints that are available
What you expected to happen: For it to see available endpoints
How to reproduce it (as minimally and precisely as possible):
Not sure I can. It appeared on the latest remove/replace of worker nodes, but was fine before
Anything else we need to know:
I have a traefik Ingress controller defined and a service for it:
$ kubectl describe service traefik-ingress-controller -n kube-system
Name: traefik-ingress-controller
Namespace: kube-system
Labels: app=traefik-ingress-controller
Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"app":"traefik-ingress-controller"},"name":"traefik-ingress-controller","nam...
Selector: name=traefik-ingress-lb
Type: NodePort
IP: 10.200.177.150
Port: http 80/TCP
NodePort: http 30080/TCP
Endpoints:
Port: https 443/TCP
NodePort: https 30443/TCP
Endpoints:
Session Affinity: None
Events: <none>
$ kubectl -n kube-system get pod --selector=name=traefik-ingress-lb
NAME READY STATUS RESTARTS AGE
traefik-ingress-controller-8vlms 1/1 Running 0 1h
traefik-ingress-controller-c81cq 1/1 Running 0 1h
traefik-ingress-controller-s51mq 1/1 Running 0 59m
Selectors match, the pods are running just fine, yet the endpoints do not show up.
I suspect if I delete the service and recreate it, it probably will pick it up, but I shouldn’t have to, and am concerned it might occur again; rather actually debug it.
How can I check what kube is doing to affiliate it? How often does it check? Where?
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 4
- Comments: 35 (20 by maintainers)
A quick follow up to let you know we haven’t forgotten about this issue. From what we can tell, this is being caused by an Endpoints informer cache that can potentially be out of date when syncService() runs. Essentially syncService() would be called twice - once for the delete and once for the create event. There’s a chance that when the second run of syncService() happens the Endpoints cache may not have been updated with the latest run. If that happens, syncService() will not make any changes because it sees an endpoints resource that has all the right ports and addresses set.
@swetharepakula is working on a fix here that will mirror what we’ve done for the EndpointSlice controller with the
EndpointSliceTracker
. This will involve watching Endpoints changes and calling syncService() if the change has not already been accounted for by the controller.As far as I can tell, this should be a relatively rare bug. It should only happen if the cache is out of date, the recreated Service is identical, and the Pods have not changed at all. Given the relative complexity of the fix here, I’m not sure how far we’ll be able to backport this. At the very least, we’re hoping to have a fix in for 1.19.
Got it, apparently realized that its not only the selector that has to match labels. The port names should be the same in the pod definition and services definition as well. The latter was the culprit in my case.
@githubvick Your service is using
name=nginx-ingress
while your pods are usingapp=nginx-ingress
.I think @swetharepakula’s PR is almost ready, sorry for not catching the lifecycle close earlier.
@sebgl Thanks for the detailed bug report! I’ll try to recreate this scenario with some new tests.
/assign