istio: HTTPS outgoing requests blocked on fresh installation helm chart >= `1.3.0` because of outbound protocol sniffing

HTTPS outgoing requests blocked on fresh installation helm chart >= 1.3.0

We used the default values from the helm chart. Assumes ALLOW_ANY mode and enableProtocolSniffingForOutbound: true

Platform: AWS Installation: helm chart version 1.3.0 Kubernetes: 1.14.7 with kops, calico networking

Affected product area (please put an X in all that apply)

[ ] Configuration Infrastructure [ ] Docs [x] Installation [x] Networking [ ] Performance and Scalability [x] Policies and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure

Expected behavior

Steps to reproduce the bug

kubectl create ns example
kubectl label namespace example istio-injection=enabled --overwrite=true
kubectl -n example run -i --tty alpine --rm  --image=alpine --restart=Never -- sh
apk add curl
curl https://google.com

Expected

curl https://google.com
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="https://www.google.com/">here</A>.
</BODY></HTML>

Exists

curl https://google.com
curl: (35) error:1408F10B:SSL routines:ssl3_get_record:wrong version number

How to fix

Set enableProtocolSniffingForOutbound: false

helm install istio.io/istio --namespace istio-system --version 1.3.0 --set pilot.enableProtocolSniffingForOutbound=false

Version (include the output of istioctl version --remote and kubectl version)

istio 1.3.0

Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.7", GitCommit:"8fca2ec50a6133511b771a11559e24191b1aa2b4", GitTreeState:"clean", BuildDate:"2019-09-18T14:47:22Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.6", GitCommit:"96fac5cd13a5dc064f7d9f4f23030a6aeface6cc", GitTreeState:"clean", BuildDate:"2019-08-19T11:05:16Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}

How was Istio installed?

helm repo add istio.io https://storage.googleapis.com/istio-release/releases/1.3.0/charts
helm install istio.io/istio-init --namespace istio-system --version 1.3.0
helm install istio.io/istio --namespace istio-system --version 1.3.0

Environment where bug was observed (cloud vendor, OS, etc)

AWS with kops 1.14.0

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 6
Comments: 17 (5 by maintainers)

Most upvoted comments

We got same issue on 1.3.3. Downgraded to 1.2.6 works fine…

xubofei1983 on Oct 21, 2019

First, let me say thank you for the many responses and the recommendation for the narrowly scoped networking.istio.io/v1alpha3:Sidecar @howardjohn; these have “resolved” the issue I’m about to describe, but I still think it’s worth noting.

I wasn’t sure if here, #22150 (since MySQL and postgres are siblings here with their own wire protocols), #16458 or #20703 were the right place. As far as I can tell, the core driver of all of these “protocol sniffing snafu” issues is the creation of Envoy listener(s) on the broadcast IP (0.0.0.0) for a given port; listeners that also happen to incorrectly match some protocol (e.g. assuming cleartext HTTP and missing tls_inspector or getting raw TCP wire protocols like postgres or mysql).

I am currently in the process of rolling out Istio to an established set of clusters that my team manages. The “protocol sniffing snafu” is a bit concerning to us and we are trying to mitigate so that we can guarantee our mesh won’t become unusable if a new Kubernetes resource (that we didn’t account for) joins our cluster.

When upgrading Istio recently; services that were already smoothly running in the mesh failed to startup. The culprit: an Envoy listener on 0.0.0.0:5432 (and these services refusing to start up if they couldn’t reach an RDS instance outside the cluster). The upgrade was from 1.4.3 to 1.5.0 (I know the protocol detection issue started occurring in the 1.3.x series), but I suspect something small may have changed between 1.4.3 and 1.5.0.

It seems the 0.0.0.0:5432 listener was due to a service port in the default namespace and not due to intrinsic naming of that service port (it was named postgres). As I mentioned, this issue did not occur in Istio 1.4.3 even though that service — postgres.default — was present when we were running 1.4.3.

UPDATE: As I took some time to reproduce this after writing, I realize now that the default namespace is not the real differentiator from the other services (which is a good thing). Instead it’s the fact that the service has no cluster IP

$ kubectl --output wide get service --namespace default     postgres
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE    SELECTOR
postgres   ClusterIP   None         <none>        5432/TCP   475d   role=postgres
$
$ kubectl --output wide get service --namespace acme-system postgres
NAME       TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)    AGE     SELECTOR
postgres   ClusterIP   10.101.9.40   <none>        5432/TCP   2y28d   k8s-app=postgres

So it seems a Kubernetes bug (i.e. the absence of a CLUSTER-IP for this long since forgotten service of type ClusterIP) is causing a bug in the Envoy listener creation.

Some notes about how I concluded it was the port in the default namespace:

At the time of debugging our cluster had a total of 35 service ports using port 5432 across all namespaces (31 of which were in the “active” namespace ./*).
Once I applied the Sidecar to limit hosts to istio-system/* and ./*, the issue went away (and quite quickly, it’s very impressive how quickly updates propagate from Pilot / istiod).
All other service ports numbered 5432 did show up as a ${CLUSTER_IP}:5432 listener, i.e. not 0.0.0.0, verified via istioctl proxy-config listeners --port 5432.
The “active” namespace where the Sidecar was applied had (at debugging time) 31 service ports running 5432. 30 of those ports had no name, 1 port was named pgsql.
After kubectl delete-ing the Sidecar, the issue immediately resurfaced.
I used kubectl edit on the offending service (postgres.default) to change service port 5432’s name from postgres to tcp-postgres and it also immediately resolved the issue. Somewhat surprisingly, after the 0.0.0.0:5432 listener disappeared, no equivalent ${CLUSTER_IP}:5432 listener popped up for postgres.default.
There was also a postgres.acme-system (replace acme with our company name) service that exposed 5432 named postgres (i.e. identical to the port name and number from the postgres.default service) but this only produced a ${CLUSTER_IP}:5432 listener.
Somewhat less significantly there were 2 other services outside of istio-system/* and ./* that exposed port 5432 with names pgsql and postgresql; both of these produced a produced a ${CLUSTER_IP}:5432 listener.

As an additional mitigation to using a networking.istio.io/v1alpha3:Sidecar we are going to require all service ports in our namespaces in the Istio service mesh to use manual protocol selection. Once all ports adhere to this rule, it will be enforced for new services with a service admission webhook.

I did a bit more digging to get the 0.0.0.0:5432 listener. It is essentially identical to all of the other 5432, except for an extra filter chain at the beginning of listeners[34]["filterChains"] pointing at BlackHoleCluster and matching the pod IP 10.101.175.16 of the ${POD_NAME} I was getting listeners from:

{
    "filterChainMatch": {
        "prefixRanges": [
            {
                "addressPrefix": "10.101.175.16",
                "prefixLen": 32
            }
        ]
    },
    "filters": [
        {
            "name": "envoy.filters.network.wasm",
            "typedConfig": {
                "@type": "type.googleapis.com/udpa.type.v1.TypedStruct",
                "typeUrl": "type.googleapis.com/envoy.config.filter.network.wasm.v2.Wasm",
                "value": {
                    "config": {
                        "configuration": "{\n  \"debug\": \"false\",\n  \"stat_prefix\": \"istio\",\n}\n",
                        "root_id": "stats_outbound",
                        "vm_config": {
                            "code": {
                                "local": {
                                    "inline_string": "envoy.wasm.stats"
                                }
                            },
                            "runtime": "envoy.wasm.runtime.null",
                            "vm_id": "stats_outbound"
                        }
                    }
                }
            }
        },
        {
            "name": "envoy.tcp_proxy",
            "typedConfig": {
                "@type": "type.googleapis.com/envoy.config.filter.network.tcp_proxy.v2.TcpProxy",
                "statPrefix": "BlackHoleCluster",
                "cluster": "BlackHoleCluster"
            }
        }
    ]
}

Additionally in listeners[34]["filterChains"][2]["filters"], the filter at index 0 (which is the envoy.http_connection_manager filter) has typedConfig.rds.routeConfigName equal to 5432, though I expected it to be postgres.default.svc.cluster.local:5432 (based on the other listeners).

NOTE: Here I’m referring to listeners as the JSON output of
.../istio-1.5.0/bin/istioctl proxy-config listeners \
  --namespace ${NAMESPACE} \
  ${POD_NAME} \
  --port 5432 \
  --output json

(Also, small world, @louiscryan was on my interview panel at Google back in 2011. He stumped me with a 2D binary search where x and y were independent.)

dhermes on Mar 18, 2020