istio: Unexplained telemetry involving passthrough and unknown

We have a demo app called “travel agency” that when run against Istio 1.6 is generating the expected telemetry but also unexpected telemetry. Initial telemtry looks good and generates an expected Kiali graph. But quickly we see an unexpected TCP edge leading to PassthroughCluster, and then again from Unknown to a destination service. After a few minutes we eventually see these additional TCP edges leading to Passthrough and then from unknown. It seems sort of like an intermittent leak of internal traffic. Here is a short video (using Kiali replay) that shows the issue. At the very beginning you see the expected, all green, all http traffic, Quickly we see some of the unexpected (blue) TCP telem. And as I skip forward and advance the frames the remaining edges show up:

travel-bad-telemetry

[ ] Configuration Infrastructure [ ] Docs [ ] Installation [x] Networking [ ] Performance and Scalability [x] Policies and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure

Expected behavior The TCP edges to PassthroughCluster, and from unknown, should not show up, which means that Istio should not generate the underlying Prometheus time-series.

Steps to reproduce the bug The travel-agency app is found here: https://github.com/lucasponce/travel-comparison-demo

There is a script to install the app here: https://github.com/jmazzitelli/test/tree/master/deploy- travel-agency

This will install travel agency on minikube:

$ CLIENT_EXE=minikube bash <(curl -L https://raw.githubusercontent.com/jmazzitelli/test/master/deploy-travel-agency/deploy-travel-agency-demo.sh)

Version (include the output of istioctl version --remote and kubectl version and helm version if you used Helm) This has been recreated on both 1.6.0 and 1.6.1 pre-release, using default (V2) telem.

How was Istio installed? istioctl

Environment where bug was observed (cloud vendor, OS, etc) This has been recreated on Minikube and OpenShift, both on bare metal and AWS.

cc @jmazzitelli @lucasponce

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 74 (45 by maintainers)

Commits related to this issue

Most upvoted comments

After increasing outbound proto sniffing timeout, unknown edge disappeared. Could you try to set --set meshConfig.protocolDetectionTimeout=1s in installation and see if it fixes your problem too? We might need to consider increasing timeout of outbound listener sniffing.

Looks like no fix is coming in 1.7 and so if affected then part of your telemetry will be reported incorrectly, coming from unknown or going to PassthroughCluster. To improve the Kiali graph I can only recommend disabling proto-sniffing completely if your app doesn’t need it [1], or hiding the unwanted traffic by entering node=unknown OR service^=Pass in the Kiali graph hide.

[1] Disable proto-sniffing by setting values.pilot.enableProtocolSniffingForInbound=false and values.pilot.enableProtocolSniffingForOutbound=false.

I’m not sure if @howardjohn has any other recommendation, I suggest pushing on https://github.com/istio/istio/issues/24998 to be fixed ASAP.

@FL3SH , your graph in particular is pretty wild. I’m not sure I’ve seen 2 PassthroughCluster nodes, I’m not sure how that happens.

@FL3SH yes, for TCP connections the graph will be disconnected if MTLS is not enabled. For HTTP request, the graph would still be connected even without MTLS, since we use headers to exchange workload metadata between source and destination.

I was able cleanup my graph quite a bit. Screenshot 2020-08-06 at 16 32 14

  • removed virutalservices for redis and mongo
  • fixed all prefixes for ports - I was missing tcp- or http- for protocol selection
  • added version labels for pods, services, and sts to remove kiali warnings
  • add missing app labels for pods, services, and sts for kiali to group them together

Thanks @naphta, I have not had the time to try that change from 5s to 6s. It seems the timeout approach may not be a sufficient fix but I’m not familiar with the underlying code/issue. I continue to use the Kiali graph-hide expression of “node=unknown OR service^=Pass” to clean up the graph, at the expense of seeing the correct traffic totals.

That is great sleuthing.

@lambdai @PiotrSikora Can this be disabled when we know the other side is http? Or is it now on by default and unchangeable ?