istio: pilot: grpc stream closures result in unwatched / lost ClusterLoadAssignments in Envoy; early readiness may result in incorrect configuration
Bug description
Following a closed gRPC stream to Pilot, Envoy sidecars end up in state where they have no ClusterLoadAssignments. This results in requests falling back to non-Istio (or ClusterIP-based) routing.
Expected behavior
When a connection to pilot is closed, the sidecar proxy should retain its current configuration.
Steps to reproduce the bug
We don’t have a clear reproducer, but in our clusters we saw 100+ instances of this with the same pattern, all within a second of each other.
From the perspective of one Envoy sidecar, its connection to a Pilot instance is closed, and it appears to have no ClusterLoadAssignment
s being watched:
[Envoy (Epoch 0)] [2020-05-26 00:11:37.919][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 13,
[Envoy (Epoch 0)] [2020-05-26 00:11:38.008][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure
[Envoy (Epoch 0)] [2020-05-26 00:11:39.020][22][warning][config] [external/envoy/source/common/config/grpc_mux_impl.cc:153] Ignoring unwatched type URL type.googleapis.com/envoy.api.v2.ClusterLoadAssignment
The sidecar’d application is still sending requests, but at this point they all fall through to the PassthroughCluster
.
In terms of ramifications / impacts, this issue has a noticiable impact on load balancing, as connections are established to a single upstream via K8s routing (via ClusterIP
/ K8s load balancing). As there is no request-level load balancing of requests on these connections, all requests sent from the sidecar’d application will always be sent to the same server, which can overwhelm the server (all other endpoints in the upstream cluster receive little, if any traffic).
Interestingly, reading where the logs occur in the Envoy source code, it apears as though we’re hitting a “shouldn’t happen” case, which smells like a bug.
// No watches and we have resources - this should not happen. send a NACK (by not
// updating the version).
ENVOY_LOG(warn, "Ignoring unwatched type URL {}", type_url);
queueDiscoveryRequest(type_url);
It’s unclear to me if this is an Istio or Envoy issue, but I thought I’d start here. I can bug our friends over in the Envoy project if it turns out the issue is elsewhere.
Version (include the output of istioctl version --remote
and kubectl version
and helm version
if you used Helm)
$ istioctl version --remote
client version: unknown
control plane version: 1.4.6
$ helm version --server
Server: &version.Version{SemVer:"v2.16.5", GitCommit:"89bd14c1541fa93a09492010030fd3699ca65a97", GitTreeState:"clean"}
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-13T18:06:54Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.11-gke.9", GitCommit:"e1af17fd873e15a48769e2c7b9851405f89e3d0d", GitTreeState:"clean", BuildDate:"2020-04-06T20:56:54Z", GoVersion:"go1.12.17b4", Compiler:"gc", Platform:"linux/amd64"}
How was Istio installed?
Helm.
Environment where bug was observed (cloud vendor, OS, etc)
GKE - 1.15.11-gke.9
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 19 (18 by maintainers)
Commits related to this issue
- Ensure properly synced before marking pilot ready For https://github.com/istio/istio/issues/24117 — committed to howardjohn/istio by howardjohn 4 years ago
@nicktrav awesome job on tracking this down! I can reproduce this as well and will investigate some more