istio: Missing endpoints when large number of pods is created at once
Bug Description
Hello everyone, I noticed some strange behaviors in Istio 1.17.1. It seems that if I start a lot of pod at the same (from my testing between ~100-250) a race condition will lead to some of those pods to not be counted as valid endpoints by the controller. This greatly affects the connectity as the proxies will not be able to handle some workloads due to “no healthy upstream”.
So far, I’m able to reproduce the issue on a DS cluster and I follow these steps:
- Deploy SM. I use the new DS feature for this.
- Deploy 250 instances of httpbin in quickly succession
- Wait some minutes for all instances to be ready
- Verify that the instance n.0 can communicate with the remaining 249. Usually there are no problems with installation but from my testing I noticed that the issue could appear even at this step (although rarely). When the issue happen curl request targeting a specific httpbin instances from other instances acting as clients will fail due to endpoints not being present.
- If it has not failed in the previous step I usually try to delete all httpbin pod instances in one go by using "kubectl delete pod -n istio-system httpbin1-… httpbin2-…
- Wait some minutes for all the new pods to be ready again
- Verify again that the instance n.0 can communicate with the remaining 249. This is usually when I realize the presence of the problem as some curl requests to specific httpbin instances are failing due to no healthy endpoints. If I check the configdump for the httpbin pod that is doing the curl requests I can see that the endpoints for one or more httpbin instances are completely missing.
Some extra info:
- I can easily reproduce the issue on my DS cluster. My IPv4 k8s cluster cannot reproduce the issue (maybe I’m unlucky with timing or the issue is related to DS itself).
- no discovery selectors are used in my deployment
- I use an httpbin service with spec.IpFamilies: [IPv6], I did not manage to reproduce the issue with spec.IpFamilies: [IPv4] so far
- Restarting the controller fix the issue
- Restarting only the pod for which the endpoints are missing (the target of the curl request) will also fix the issue as the other proxies will receive the missing endpoints entries
- I use a custom build that is Istio 1.17.1
I’m currently working on a bash script to reproduce the issue using istioctl, but I’m having some difficulties at the moment. Let me know if you need debug logs, I can also add some “custom logs messages” if you tell me where you want them and reproduce the issue afterwards.
Version
client version: 1.17.1
control plane version: 1.17-dev
data plane version: 1.17.1 (250 proxies)
Additional Information
Here are the outcome of istioctl bug-report
- bug-report where httpbin0 (and other httpbin instances too) cannot directly communicate with httpbin98 after installation due to lack of endpoints: https://www.dropbox.com/s/v84y046yuc3c351/bug-report-missing-endpoints.zip?dl=0
- bug-report after controller has been restarted: https://www.dropbox.com/s/5gibejsq3oxr8d1/bug-report-after-reboot.zip?dl=0
Here is the definition of the httpbin deployments ({number} is a number between 0-250) with spec.ipFamilies defined:
apiVersion: v1
kind: Service
metadata:
name: httpbin{number}
labels:
app: httpbin{number}
spec:
ipFamilies: ["IPv6"]
ports:
- name: http
port: 8000
targetPort: 80
selector:
app: httpbin{number}
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: httpbin{number}
spec:
replicas: 1
selector:
matchLabels:
app: httpbin{number}
version: v1
template:
metadata:
labels:
app: httpbin{number}
version: v1
annotations:
sidecar.istio.io/inject: "true"
spec:
imagePullSecrets:
- name: armdocker
containers:
- image: docker.io/kong/httpbin
imagePullPolicy: Always
name: httpbin
ports:
- containerPort: 80
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 15 (14 by maintainers)
Commits related to this issue
- fixed issue #44043 — committed to zhlsunshine/istio by zhlsunshine a year ago
Found it! will post details soon