istio: Unexplained telemetry involving passthrough and unknown

We have a demo app called “travel agency” that when run against Istio 1.6 is generating the expected telemetry but also unexpected telemetry. Initial telemtry looks good and generates an expected Kiali graph. But quickly we see an unexpected TCP edge leading to PassthroughCluster, and then again from Unknown to a destination service. After a few minutes we eventually see these additional TCP edges leading to Passthrough and then from unknown. It seems sort of like an intermittent leak of internal traffic. Here is a short video (using Kiali replay) that shows the issue. At the very beginning you see the expected, all green, all http traffic, Quickly we see some of the unexpected (blue) TCP telem. And as I skip forward and advance the frames the remaining edges show up:

travel-bad-telemetry

[ ] Configuration Infrastructure [ ] Docs [ ] Installation [x] Networking [ ] Performance and Scalability [x] Policies and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure

Expected behavior The TCP edges to PassthroughCluster, and from unknown, should not show up, which means that Istio should not generate the underlying Prometheus time-series.

Steps to reproduce the bug The travel-agency app is found here: https://github.com/lucasponce/travel-comparison-demo

There is a script to install the app here: https://github.com/jmazzitelli/test/tree/master/deploy- travel-agency

This will install travel agency on minikube:

$ CLIENT_EXE=minikube bash <(curl -L https://raw.githubusercontent.com/jmazzitelli/test/master/deploy-travel-agency/deploy-travel-agency-demo.sh)

Version (include the output of istioctl version --remote and kubectl version and helm version if you used Helm) This has been recreated on both 1.6.0 and 1.6.1 pre-release, using default (V2) telem.

How was Istio installed? istioctl

Environment where bug was observed (cloud vendor, OS, etc) This has been recreated on Minikube and OpenShift, both on bare metal and AWS.

cc @jmazzitelli @lucasponce

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 1
Comments: 74 (45 by maintainers)

Commits related to this issue

Increase protocolDetectionTimeout to solve unneeded PASSTHROUGH telemetries See: https://github.com/istio/istio/issues/24379 — committed to banzaicloud/istio-operator by Laci21 4 years ago
Increase protocolDetectionTimeout to solve unneeded PASSTHROUGH telemetries See: https://github.com/istio/istio/issues/24379 — committed to banzaicloud/istio-operator by Laci21 4 years ago

Most upvoted comments

After increasing outbound proto sniffing timeout, unknown edge disappeared. Could you try to set --set meshConfig.protocolDetectionTimeout=1s in installation and see if it fixes your problem too? We might need to consider increasing timeout of outbound listener sniffing.

bianpengyuan on Jun 9, 2020

Looks like no fix is coming in 1.7 and so if affected then part of your telemetry will be reported incorrectly, coming from unknown or going to PassthroughCluster. To improve the Kiali graph I can only recommend disabling proto-sniffing completely if your app doesn’t need it [1], or hiding the unwanted traffic by entering node=unknown OR service^=Pass in the Kiali graph hide.

[1] Disable proto-sniffing by setting values.pilot.enableProtocolSniffingForInbound=false and values.pilot.enableProtocolSniffingForOutbound=false.

I’m not sure if @howardjohn has any other recommendation, I suggest pushing on https://github.com/istio/istio/issues/24998 to be fixed ASAP.

@FL3SH , your graph in particular is pretty wild. I’m not sure I’ve seen 2 PassthroughCluster nodes, I’m not sure how that happens.

jshaughn on Aug 6, 2020

@FL3SH yes, for TCP connections the graph will be disconnected if MTLS is not enabled. For HTTP request, the graph would still be connected even without MTLS, since we use headers to exchange workload metadata between source and destination.

bianpengyuan on Aug 7, 2020

I was able cleanup my graph quite a bit. Screenshot 2020-08-06 at 16 32 14

removed virutalservices for redis and mongo
fixed all prefixes for ports - I was missing tcp- or http- for protocol selection
added version labels for pods, services, and sts to remove kiali warnings
add missing app labels for pods, services, and sts for kiali to group them together

FL3SH on Aug 6, 2020

Thanks @naphta, I have not had the time to try that change from 5s to 6s. It seems the timeout approach may not be a sufficient fix but I’m not familiar with the underlying code/issue. I continue to use the Kiali graph-hide expression of “node=unknown OR service^=Pass” to clean up the graph, at the expense of seeing the correct traffic totals.

jshaughn on Jul 14, 2020

That is great sleuthing.

@lambdai @PiotrSikora Can this be disabled when we know the other side is http? Or is it now on by default and unchangeable ?

mandarjog on Jun 9, 2020