istio: Possible critical bug involving Istio configs from isolated namespaces causing blackholes cluster-wide

Is this the right place to submit this?

  • This is not a security vulnerability or a crashing bug
  • This is not a question about how to use Istio

Bug Description

Hello, so we have been having a sporadic issue that has been a hell of a time tracking down, but we know something is wrong. 4 times now, over 3 different Istio versions (1.13>1.15>1.17) and 6 months, we’ve had BlackHoleClusters randomly start happening on different clusters and different applications, then fixing themselves. We also use Istio SmartDNS with VIP auto-allocation. This snippet is on all of our deployments enterprise-wide:

proxy.istio.io/config: |
    proxyMetadata:
        ISTIO_META_DNS_CAPTURE: "true"
        ISTIO_META_DNS_AUTO_ALLOCATE: "true"

Here is an example over the past 4 hours:

image

You can see they happen(green bars)…go away…happen…go away. Also note the multiple VIP’s associated with the same ServiceEntry, which is below: (is that expected, that a VIP constantly changes for the same endpoint? Seems odd)

apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  name: elasticache-beta-rocket-fdx-api
  namespace: beta-212697-rocket-fdx-api
spec:
  exportTo:
    - .
  hosts:
    - master.beta-212697-rocket-fdx-api.pvfqf7.use2.cache.amazonaws.com
  location: MESH_EXTERNAL
  ports:
    - name: elasticache
      number: 6379
      protocol: TCP
  resolution: DNS

This particular application has had no changes, pushes, nothing all day. We have dozens of applications on this cluster, all with exportTo: . on their ServiceEntry. Here’s what we’ve discerned so far:

  1. The BlackHoleClusters coincide with large Istiod pushes…if I overlay istiod logs on top of the BlackHoleClusters, the red arrows are when the blackholes start, and the blue arrows are when it resolves.

image

  1. None of the ServiceEntries on our cluster are global…so this isn’t a “conflict” situation. We enforce exportTo: . universally. However, VirtualServices and DestinationRules are exported mesh-wide, though none of them are related to this particular URL.

  2. This isn’t endpoint or TCP specific…this also happens to multiple apps hitting DynamoDB. In this example, you can see it going from working, to blackhole, to working, while switching VIPS (240.240.253.180, 240.240.252.199):

test-rocket-client-info-store-api-59976bcf76-nsp7h istio-proxy [2023-11-27T16:09:42.270Z] “- - -” 0 - “-” 3500 5830 65201 - “-” “-” “-” “-” “52.94.4.150:443” outbound|443||dynamodb.us-east-2.amazonaws.com 100.64.16.133:58816 240.240.253.180:443 100.64.16.133:39484 - - - test-rocket-client-info-store-api-59976bcf76-nsp7h istio-proxy [2023-11-27T16:13:56.972Z] “- - -” 0 UH “-” 0 0 0 - “-” “-” “-” “-” “-” BlackHoleCluster - 240.240.253.180:443 100.64.16.133:56838 dynamodb.us-east-2.amazonaws.com - - test-rocket-client-info-store-api-59976bcf76-nsp7h istio-proxy [2023-11-27T17:04:51.606Z] “- - -” 0 - “-” 3469 5799 61007 - “-” “-” “-” “-” “35.71.102.101:443” outbound|443||dynamodb.us-east-2.amazonaws.com 100.64.16.133:53284 240.240.253.199:443 100.64.16.133:51640 - - - test-rocket-client-info-store-api-59976bcf76-nsp7h istio-proxy [2023-11-27T17:05:30.457Z] “- - -” 0 - “-” 3469 6355 52608 - “-” “-” “-” “-” “52.94.4.158:443” outbound|443||dynamodb.us-east-2.amazonaws.com 100.64.16.133:41732 240.240.253.180:443 100.64.16.133:42652 - - -

  1. If I use the Sidecar resource, and lock down the namespace to only care about its own Istio config, this problem goes away. So while I technically have a fix and probably should use Sidecars anyway, I shouldn’t have to.

Based off the above, it seems like Istio configurations (VS, DR, SE, etc.) deployed in other namespaces, which then causes Istiod ids pushes, somehow affect egress traffic in other namespaces, even though aren’t related, nor are ServiceEntries being exported to anything other than their own namespace. I noticed the VIP’s are changing, one time during a BlackHoleCluster I noticed it was to a VIP that wasn’t listed in my istioctl proxy-config listeners... output…once it started working, it was. Maybe Istiod is changing VIP’s but then lagging on pushing them out? My only theory so far.

Myself and 2 co-workers are actively diving into this but are at a loss other than what we’ve described above. Happy to get any other information, logs, etc. We are worried if this starts happening more often in production.

Version

istioctl version
client version: 1.19.3
control plane version: 1.17.8
data plane version: 1.17.8 (168 proxies)

kubectl version
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.12", GitCommit:"ba490f01df1945d0567348b271c79a2aece7f623", GitTreeState:"clean", BuildDate:"2023-07-19T12:23:43Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"darwin/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.17-eks-4f4795d", GitCommit:"af19e454a15b5eb16d9f29d4d2361b3050ac78a6", GitTreeState:"clean", BuildDate:"2023-10-20T23:22:36Z", GoVersion:"go1.20.10", Compiler:"gc", Platform:"linux/amd64"}

Additional Information

Istio push metrics for the pod (2 days old)

pilot_xds_pushes{type=“cds”} 24541 pilot_xds_pushes{type=“eds”} 60245 pilot_xds_pushes{type=“lds”} 24589 pilot_xds_pushes{type=“nds”} 10903 pilot_xds_pushes{type=“rds”} 24546

About this issue

  • Original URL
  • State: open
  • Created 7 months ago
  • Comments: 38 (22 by maintainers)

Most upvoted comments

@ramaraochavali and I will take a look, we did change the allocation algorithm some releases before, but it was intended to behave better.

SE is currently a namespaced resource that directly adds services to the global service registry. These kinds of conflicts are honestly expected (not intended) due to the design. I’m ok with mitigation to help users, but I remember there was a lot of discussion around the fundamental design of this resource and what it was meant to do. If we want to charge SE scope to be more like a namespaced resource (instead of the in-between version we have now) we may want to document it somewhere and potentially make other changes?

/cc @kdorosh I think this is another use-case for your ServiceEntry doc from awhile ago