istio: pilot does not see all pods in service

Bug description we have service with two pods. Sometimes Istio ingressgateway see only one of them But all pod works fine. There are endpoints in the service:

kubectl describe svc -n pfphome-prod pfphome-pfp-service
Name:              pfphome-pfp-service
Namespace:         pfphome-prod
Selector:          app.kubernetes.io/name=pfphome,app.pfp.dev/envaronment=prod,app.pfp.dev/release=pfphome-master
Type:              ClusterIP
IP:                10.103.168.183
Port:              http  80/TCP
TargetPort:        http/TCP
Endpoints:         10.103.68.4:3000,10.103.72.2:3000
Session Affinity:  None
Events:            <none>

But in envoy clusterconfig only:

outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::default_priority::max_connections::4294967295
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::default_priority::max_pending_requests::4294967295
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::default_priority::max_requests::4294967295
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::default_priority::max_retries::3
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::high_priority::max_connections::1024
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::high_priority::max_pending_requests::1024
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::high_priority::max_requests::1024
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::high_priority::max_retries::3
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::added_via_api::true
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::cx_active::1
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::cx_connect_fail::0
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::cx_total::5528
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::rq_active::0
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::rq_error::0
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::rq_success::15124
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::rq_timeout::0
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::rq_total::15137
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::hostname::
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::health_flags::healthy
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::weight::1
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::region::europe-north1
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::zone::europe-north1-a
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::sub_zone::
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::canary::false
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::priority::0
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::success_rate::-1
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::local_origin_success_rate::-1

if I recreate pilot

kubectl delete po -n istio-system istio-pilot-69556cbcd6-vx4sz
pod "istio-pilot-69556cbcd6-vx4sz" deleted
kubectl logs -n istio-system istio-pilot-69556cbcd6-btqmk discovery | grep pfphome-prod
2020-01-10T09:43:58.740662Z	info	Handling event add for pod pfphome-pfp-app-86c8c87458-wvgm7 in namespace pfphome-prod -> 10.103.68.4
2020-01-10T09:43:58.741579Z	info	Handling event add for pod pfphome-pfp-app-86c8c87458-m9vtn in namespace pfphome-prod -> 10.103.72.2
2020-01-10T09:43:58.742078Z	info	Handle EDS endpoint pfphome-pfp-service in namespace pfphome-prod -> [10.103.68.4 10.103.72.2]
2020-01-10T09:43:58.742099Z	info	ads	Full push, new service pfphome-pfp-service.pfphome-prod.svc.cluster.local
2020-01-10T09:43:58.742107Z	info	ads	Endpoint updating service account spiffe://cluster.local/ns/pfphome-prod/sa/default pfphome-pfp-service.pfphome-prod.svc.cluster.local

I take correct config in ingressgateway:

outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::default_priority::max_connections::4294967295
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::default_priority::max_pending_requests::4294967295
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::default_priority::max_requests::4294967295
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::default_priority::max_retries::3
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::high_priority::max_connections::1024
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::high_priority::max_pending_requests::1024
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::high_priority::max_requests::1024
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::high_priority::max_retries::3
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::added_via_api::true
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::cx_active::2
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::cx_connect_fail::0
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::cx_total::5679
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::rq_active::0
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::rq_error::0
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::rq_success::16796
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::rq_timeout::0
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::rq_total::16814
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::hostname::
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::health_flags::healthy
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::weight::1
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::region::europe-north1
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::zone::europe-north1-a
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::sub_zone::
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::canary::false
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::priority::0
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::success_rate::-1
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.68.4:3000::local_origin_success_rate::-1
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.72.2:3000::cx_active::2
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.72.2:3000::cx_connect_fail::0
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.72.2:3000::cx_total::10
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.72.2:3000::rq_active::1
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.72.2:3000::rq_error::0
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.72.2:3000::rq_success::62
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.72.2:3000::rq_timeout::0
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.72.2:3000::rq_total::63
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.72.2:3000::hostname::
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.72.2:3000::health_flags::healthy
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.72.2:3000::weight::1
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.72.2:3000::region::europe-north1
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.72.2:3000::zone::europe-north1-a
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.72.2:3000::sub_zone::
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.72.2:3000::canary::false
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.72.2:3000::priority::0
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.72.2:3000::success_rate::-1
outbound|80||pfphome-pfp-service.pfphome-prod.svc.cluster.local::10.103.72.2:3000::local_origin_success_rate::-1

and traffic goes to both pods image

Expected behavior

Steps to reproduce the bug

Version (include the output of istioctl version --remote and kubectl version and helm version if you used Helm)

istioctl version --remote
client version: 1.4.0
control plane version: 1.4.2
data plane version: 1.4.2 (4 proxies)

How was Istio installed?

Environment where bug was observed (cloud vendor, OS, etc) GKE

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 18
  • Comments: 36 (19 by maintainers)

Most upvoted comments

Yes! With release 1.4.5 it was solved! There is no any “Endpoint without pod” in logs in last 3 days! And pilot is using all endpoints in all services. Thanks