istio: listener conflicts related to headless services (warnings, errors, etc.)

Hi, I’m seeing the following error in istio-pilot/discovery:

2018-06-03T00:27:50.659513Z     warn    buildSidecarOutboundListeners: listener conflict (TCP current and new TCP) on 0.0.0.0:9094, destination:outbound|9094||prometheus-alertmanager.istio-system.svc.cluster.local, current Listener: (0.0.0.0_9094 name:"0.0.0.0_9094" address:<socket_address:<address:"0.0.0.0" port_value:9094 > > filter_chains:<filters:<name:"envoy.tcp_proxy" config:<fields:<key:"deprecated_v1" value:<bool_value:true > > fields:<key:"value" value:<struct_value:<fields:<key:"route_config" value:<struct_value:<fields:<key:"routes" value:<list_value:<values:<struct_value:<fields:<key:"cluster" value:<string_value:"outbound|9094||alertmanager.data-platform.svc.cluster.local" > > > > > > > > > > fields:<key:"stat_prefix" value:<string_value:"outbound|tcp|9094" > > > > > > > > deprecated_v1:<bind_to_port:<> > )

For context, we have two teams running alertmanager, in separate namespaces.

  1. alertmanager.data-platform.svc.cluster.local
  2. prometheus-alertmanager.istio-system.svc.cluster.local

It appears when I use the same port section in the service for each of these, eg:

spec:
  ports:
  - name: cluster
    port: 9094
    protocol: TCP
    targetPort: cluster

If I rename the port in one of the second services to something else, everything starts working again and the errors in istio pilot go away.

EG:

spec:
  ports:
  - name: cluster
    port: 9094
    protocol: TCP
    targetPort: cluster

Adding a prefix resolves it:

spec:
  ports:
  - name: http-cluster
    port: 9094
    protocol: TCP
    targetPort: cluster

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 36 (35 by maintainers)

Commits related to this issue

Most upvoted comments

Hello, I’m really sorry but I’m not OK at all with this thread. We just had an outage caused by the following sequence of events:

One team is running solr, in their own namespace, this this service:

❯ k -n search-solr get service solr -o yaml
apiVersion: v1
kind: Service
metadata:
  name: solr
  namespace: search-solr
spec:
  ports:
  - name: http-server
    port: 80
    protocol: TCP
    targetPort: http-server
  selector:
    app: solr
  type: ClusterIP

Another team, in another namespace, completely unrelated, deployed a headless service on the same port as the http solr service:

apiVersion: v1
kind: Service
metadata:
  name: airflow-worker
  namespace: data-platform-airflow
spec:
  clusterIP: None
  ports:
  - port: 80
    name: airflow-worker
  selector:
    app: airflow-worker

This results in:

2018-07-06T11:15:52.339608Z     warn    buildSidecarOutboundListeners: listener conflict (HTTP current and new TCP) on 0.0.0.0:80, destination:outbound|80||solr.search-solr.svc.cluster.local, current List
ener: (0.0.0.0_80 name:"0.0.0.0_80" address:<socket_address:<address:"0.0.0.0" port_value:80 > > filter_chains:<filters:<name:"envoy.tcp_proxy" config:<fields:<key:"deprecated_v1" value:<bool_value:true >
 > fields:<key:"value" value:<struct_value:<fields:<key:"route_config" value:<struct_value:<fields:<key:"routes" value:<list_value:<values:<struct_value:<fields:<key:"cluster" value:<string_value:"outboun
d|80||airflow-worker.data-platform-airflow.svc.cluster.local" > > > > > > > > > > fields:<key:"stat_prefix" value:<string_value:"outbound|tcp|80" > > > > > > > > deprecated_v1:<bind_to_port:<> > )

And it totally broke all communication to http://solr.search-solr, and solr started receiving traffic from airflow!

WARN  2018-07-06 11:12:06,491 [qtp705265961-104][HttpParser.java:1418] : bad HTTP parsed: 400 Illegal character 0x16 for HttpChannelOverHttp@2196c7c0{r=0,c=false,a=IDLE,uri=null}
WARN  2018-07-06 11:12:06,580 [qtp705265961-100][HttpParser.java:1802] : Illegal character 0x16 in state=START for buffer HeapByteBuffer@6bbee1e0[p=1,l=143,c=8192,r=142]={\x16<<<\x03\x01\x00\x8a\x01\

Removing airflow-worker restored communication to search-solr.

The fact that people working in isolated namespaces can effectively cause service impacting issues in totally unrelated other namespaces due to their port configuration is not acceptable. We basically cannot use Istio on a multi-tenant cluster where people are empowered to write their own yaml files, if we do not get a resolution to this problem.

@nmittler @sakshigoel12

Looking closely at the problem, this conflict arises due to multiple protocols on same port with a wildcard listener. Even if we setup listeners on every service IP, the same problem will occur when there are two service entries with HTTP and TCP, pointing to two different services. Put another way, the whole of Istio is built on the assumption of client-side load balancing, service discovery, etc. Headless services and stateful sets break that pattern by directly accessing a service instance, instead of accessing through a virtual IP.

There are a few ways to solve this IMO:

  1. Enforce same protocol for all services on same port
  2. Enforce unique ports for each service

Either of these options requires a CI/CD or some form of admission control system that can detect port conflicts and prevent the services from being deployed in Kubernetes (and give a helpful error message to the end user). @ayj is it possible to write such an admission control plugin for services? Or is it restricted to just CRDs ?

@rshriram has it crossed your mind, even for one second that consumers of Istio, that don’t always understand the internals of the code base, report issues based on their perception of the world and then rely on “the experts” to help them identify the root cause.

This issue details my experiences, and the logs I see. And as I said early on in the issue, I did see a service outage then too but couldn’t quite identify what caused it.

In both circumstances, I see the same warnings in pilot, so I am obviously going to continue to talk on the same thread. From a consumers perspective, they are related in the sense that in both situations, people expect to be able to deploy into a namespace with isolation and Istio breaks that mould and that is very dangerous.

If you would like to break it up, please go ahead. But stop being difficult and pedantic with someone who is just trying to help you find issues in a product before you fly your flagship “1.0 release”.

Can you remove the ClusterIP: None from your spec? It seems that when reading from Kubernetes, it is giving Pilot a cluster IP of 0.0.0.0 for both services. This results in a collision (two listeners on the same IP:port (0.0.0.0:9999).