istio: Unexplained telemetry involving passthrough and unknown
We have a demo app called “travel agency” that when run against Istio 1.6 is generating the expected telemetry but also unexpected telemetry. Initial telemtry looks good and generates an expected Kiali graph. But quickly we see an unexpected TCP edge leading to PassthroughCluster, and then again from Unknown to a destination service. After a few minutes we eventually see these additional TCP edges leading to Passthrough and then from unknown. It seems sort of like an intermittent leak of internal traffic. Here is a short video (using Kiali replay) that shows the issue. At the very beginning you see the expected, all green, all http traffic, Quickly we see some of the unexpected (blue) TCP telem. And as I skip forward and advance the frames the remaining edges show up:
[ ] Configuration Infrastructure [ ] Docs [ ] Installation [x] Networking [ ] Performance and Scalability [x] Policies and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure
Expected behavior The TCP edges to PassthroughCluster, and from unknown, should not show up, which means that Istio should not generate the underlying Prometheus time-series.
Steps to reproduce the bug The travel-agency app is found here: https://github.com/lucasponce/travel-comparison-demo
There is a script to install the app here: https://github.com/jmazzitelli/test/tree/master/deploy- travel-agency
This will install travel agency on minikube:
$ CLIENT_EXE=minikube bash <(curl -L https://raw.githubusercontent.com/jmazzitelli/test/master/deploy-travel-agency/deploy-travel-agency-demo.sh)
Version (include the output of istioctl version --remote
and kubectl version
and helm version
if you used Helm)
This has been recreated on both 1.6.0 and 1.6.1 pre-release, using default (V2) telem.
How was Istio installed? istioctl
Environment where bug was observed (cloud vendor, OS, etc) This has been recreated on Minikube and OpenShift, both on bare metal and AWS.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 74 (45 by maintainers)
Commits related to this issue
- Increase protocolDetectionTimeout to solve unneeded PASSTHROUGH telemetries See: https://github.com/istio/istio/issues/24379 — committed to banzaicloud/istio-operator by Laci21 4 years ago
- Increase protocolDetectionTimeout to solve unneeded PASSTHROUGH telemetries See: https://github.com/istio/istio/issues/24379 — committed to banzaicloud/istio-operator by Laci21 4 years ago
After increasing outbound proto sniffing timeout, unknown edge disappeared. Could you try to set
--set meshConfig.protocolDetectionTimeout=1s
in installation and see if it fixes your problem too? We might need to consider increasing timeout of outbound listener sniffing.Looks like no fix is coming in 1.7 and so if affected then part of your telemetry will be reported incorrectly, coming from
unknown
or going toPassthroughCluster
. To improve the Kiali graph I can only recommend disabling proto-sniffing completely if your app doesn’t need it [1], or hiding the unwanted traffic by enteringnode=unknown OR service^=Pass
in the Kiali graph hide.[1] Disable proto-sniffing by setting values.pilot.enableProtocolSniffingForInbound=false and values.pilot.enableProtocolSniffingForOutbound=false.
I’m not sure if @howardjohn has any other recommendation, I suggest pushing on https://github.com/istio/istio/issues/24998 to be fixed ASAP.
@FL3SH , your graph in particular is pretty wild. I’m not sure I’ve seen 2 PassthroughCluster nodes, I’m not sure how that happens.
@FL3SH yes, for TCP connections the graph will be disconnected if MTLS is not enabled. For HTTP request, the graph would still be connected even without MTLS, since we use headers to exchange workload metadata between source and destination.
I was able cleanup my graph quite a bit.![Screenshot 2020-08-06 at 16 32 14](https://user-images.githubusercontent.com/9084725/89579282-cba8f200-d833-11ea-9b65-d9f35cbe93c3.png)
tcp-
orhttp-
for protocol selectionversion
labels for pods, services, and sts to remove kiali warningsapp
labels for pods, services, and sts for kiali to group them togetherThanks @naphta, I have not had the time to try that change from 5s to 6s. It seems the timeout approach may not be a sufficient fix but I’m not familiar with the underlying code/issue. I continue to use the Kiali graph-hide expression of “node=unknown OR service^=Pass” to clean up the graph, at the expense of seeing the correct traffic totals.
That is great sleuthing.
@lambdai @PiotrSikora Can this be disabled when we know the other side is http? Or is it now on by default and unchangeable ?