istio: [Istio-1.3.0-rc0] Short protocol detection timeout can fail https requests

Bug description

Follow instructions of Access External Service, everything works till we hit the section access external https service

At this point, after applying service entry for www.google.com, I’m expecting http returns 200 even with registry_only mode, however I still see

kubectl exec -it $SOURCE_POD -c sleep -- curl -I https://www.google.com | grep  "HTTP/"
command terminated with exit code 35

Affected product area (please put an X in all that apply)

[ ] Configuration Infrastructure [X ] Docs [ ] Installation [ X] Networking [ ] Performance and Scalability [ ] Policies and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure

Expected behavior

Steps to reproduce the bug

Version (include the output of istioctl version --remote and kubectl version)

Istio 1.3.0 rc01

How was Istio installed?

GKE v1.12.8-gke.10

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 53 (47 by maintainers)

Most upvoted comments

Conclude the internal discussion:

  1. Setting timeout for inbound listener to 1s. The reason is considering the RTT in network, e.g., RTT for cross continent traffic
  2. Setting timeout for outbound listener to 100ms. Traffic is from localhost.

cc @duderino @costinm @rlenglet @howardjohn

Copy paste proposed solution from @rshriram

when we construct the filter chains, we add some metadata or some other param to track how many protocols are being detected…

  1. if its http & tcp only, then set 10ms timeout
  2. if its https only (i.e. sni based routing), no timeout
  3. if its http, https, and tcp, then it is a case where the user has both http and https services on same port. We can set a large timeout here like 200ms.

on inbound, we should not increase timeout at all IMO. inbound scenario is straight forward. its always between http and tcp [with or without tls sniffing]. or if we feel too nervous about 10ms, then we can bump it to 25ms. But really, the 90% of k8s clusters are inside one dc where RTTs are <1ms

@louiscryan @howardjohn and @yxue and I had a chat. The plan for 1.3.0 is:

  1. @howardjohn will disable tls inspector for ingress gateways (or if it can’t be removed, ensure that the listener filter timeout will not affect it). It will only be enabled for sidecars and egress gateways.
  2. @yxue will increase the default timeout to 1 sec
  3. @yxue will add a DEBUG-level log message to Envoy whenever the timeout is hit
  4. best effort/if there is time (and there probably isn’t, but it can go into 1.3.1): @yxue will export timeout metrics from Envoy through Istio.

@rlenglet changing to 10s does make a difference, now curl -i -V reliably succeeds. Great sniffing on the protocol sniffing’s problem!

+1 on the tricky to set the timeout…

Is there an easy way to tell from the istio-proxy logs? If this is the issue, I bet it will happen frequently to users, and we should have a quick way of diagnosing this. And ideally that should be included in the troubleshooting docs.

Works for me