istio: pilot: grpc stream closures result in unwatched / lost ClusterLoadAssignments in Envoy; early readiness may result in incorrect configuration

Bug description

Following a closed gRPC stream to Pilot, Envoy sidecars end up in state where they have no ClusterLoadAssignments. This results in requests falling back to non-Istio (or ClusterIP-based) routing.

Expected behavior

When a connection to pilot is closed, the sidecar proxy should retain its current configuration.

Steps to reproduce the bug

We don’t have a clear reproducer, but in our clusters we saw 100+ instances of this with the same pattern, all within a second of each other.

From the perspective of one Envoy sidecar, its connection to a Pilot instance is closed, and it appears to have no ClusterLoadAssignments being watched:

[Envoy (Epoch 0)] [2020-05-26 00:11:37.919][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 13,
[Envoy (Epoch 0)] [2020-05-26 00:11:38.008][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure
[Envoy (Epoch 0)] [2020-05-26 00:11:39.020][22][warning][config] [external/envoy/source/common/config/grpc_mux_impl.cc:153] Ignoring unwatched type URL type.googleapis.com/envoy.api.v2.ClusterLoadAssignment

The sidecar’d application is still sending requests, but at this point they all fall through to the PassthroughCluster.

In terms of ramifications / impacts, this issue has a noticiable impact on load balancing, as connections are established to a single upstream via K8s routing (via ClusterIP / K8s load balancing). As there is no request-level load balancing of requests on these connections, all requests sent from the sidecar’d application will always be sent to the same server, which can overwhelm the server (all other endpoints in the upstream cluster receive little, if any traffic).

Interestingly, reading where the logs occur in the Envoy source code, it apears as though we’re hitting a “shouldn’t happen” case, which smells like a bug.

      // No watches and we have resources - this should not happen. send a NACK (by not
      // updating the version).
      ENVOY_LOG(warn, "Ignoring unwatched type URL {}", type_url);
      queueDiscoveryRequest(type_url);

It’s unclear to me if this is an Istio or Envoy issue, but I thought I’d start here. I can bug our friends over in the Envoy project if it turns out the issue is elsewhere.

Version (include the output of istioctl version --remote and kubectl version and helm version if you used Helm)

$ istioctl version --remote
client version: unknown
control plane version: 1.4.6
$ helm version --server
Server: &version.Version{SemVer:"v2.16.5", GitCommit:"89bd14c1541fa93a09492010030fd3699ca65a97", GitTreeState:"clean"}
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-13T18:06:54Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.11-gke.9", GitCommit:"e1af17fd873e15a48769e2c7b9851405f89e3d0d", GitTreeState:"clean", BuildDate:"2020-04-06T20:56:54Z", GoVersion:"go1.12.17b4", Compiler:"gc", Platform:"linux/amd64"}

How was Istio installed?

Helm.

Environment where bug was observed (cloud vendor, OS, etc)

GKE - 1.15.11-gke.9

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 19 (18 by maintainers)

Commits related to this issue

Most upvoted comments

@nicktrav awesome job on tracking this down! I can reproduce this as well and will investigate some more