istio: zookeeper stops working after injecting istio-proxy

Bug description

  1. Installed zookeeper via helm chart via instructions here: https://github.com/helm/charts/tree/master/incubator/zookeeper

All worked fine and validated each of the 3 pods within the stateful set is good and the quorum is established.

  1. Annotate the namespace with istio auto injection and kill each of the 3 zookeeper pod. Watch the pod restarted with istio-proxy however, none of the pod become running for long, always restarting:
$ k get services
NAME                 TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                      AGE
kubernetes           ClusterIP   172.21.0.1     <none>        443/TCP                      167d
zookeeper            ClusterIP   172.21.229.8   <none>        2181/TCP                     4h42m
zookeeper-headless   ClusterIP   None           <none>        2181/TCP,3888/TCP,2888/TCP   4h42m
(⎈ |linistio10/7caab3af9f514f028081a8180c107b69:default)
~/Downloads/istio-1.4.0 ⌚ 20:31:55
$ k get pods
NAME          READY   STATUS             RESTARTS   AGE
zookeeper-0   1/2     Running            62         3h32m
zookeeper-1   1/2     CrashLoopBackOff   61         3h31m
zookeeper-2   2/2     Running            62         3h30m
(⎈ |linistio10/7caab3af9f514f028081a8180c107b69:default)
~/Downloads/istio-1.4.0 ⌚ 20:38:48
$ k get statefulset
NAME        READY   AGE
zookeeper   1/3     4h49m

Chatted with @hzxuzhonghu briefly via #networking channel on slack - would like to open an issue to track this.

Expected behavior zookeeper continues to work, at least in permissive mode.

Steps to reproduce the bug see above

Version (include the output of istioctl version --remote and kubectl version and helm version if you used Helm) $ istioctl version client version: 1.4.0 control plane version: 1.4.0 data plane version: 1.3.2 (3 proxies), 1.4.0 (4 proxies)

How was Istio installed? istioctl manifest apply

Environment where bug was observed (cloud vendor, OS, etc) IBM Cloud K8s 1.14 cluster

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 5
  • Comments: 30 (25 by maintainers)

Commits related to this issue

Most upvoted comments

works pretty well after i remove the annotation! @banix @snible thank you much for the suggestion of using quorumListenOnAllIPs.

Here is what I did.

  1. I had a generated zookeeper yaml based on helm template cmd. Add the following to zookeeper cm:

echo “quorumListenOnAllIPs=true” >> $ZK_CONFIG_FILE

  1. remove all istio related annotations for exclude ports in the zookeeper yaml

  2. redeploy zookeeper yaml file. make sure all pods are deployed new and check init container to ensure the ports 2888/3888 aren’t there.

  3. all zookeeper pods should be up running. exec into any of the zookerpods.

$ k exec -it zookeeper-0 bash -c istio-proxy
istio-proxy@zookeeper-0:/$ nc -v zookeeper.default.svc.cluster.local 2181
zookeeper.default.svc.cluster.local [172.21.225.83] 2181 (?) open
status
Zookeeper version: 3.5.5-390fe37ea45dee01bf87dc1c042b5e3dcce88653, built on 05/03/2019 12:07 GMT
Clients:
 /127.0.0.1:37924[0](queued=0,recved=1,sent=0)

Latency min/avg/max: 0/0/0
Received: 5
Sent: 4
Connections: 1
Outstanding: 0
Zxid: 0x700000000
Mode: follower
Node count: 5

istio-proxy@zookeeper-0:/$ netstat -pan | grep 3888
tcp        0      0 127.0.0.1:45104         127.0.0.1:3888          ESTABLISHED 19/envoy            
tcp        0      0 127.0.0.1:45102         127.0.0.1:3888          ESTABLISHED 19/envoy            
tcp6       0      0 :::3888                 :::*                    LISTEN      -                   
tcp6       0      0 127.0.0.1:3888          127.0.0.1:45104         ESTABLISHED -                   
tcp6       0      0 127.0.0.1:3888          127.0.0.1:45102         ESTABLISHED -                   
istio-proxy@zookeeper-0:/$ netstat -pan | grep 2888
tcp        0      0 172.30.244.96:34532     172.30.196.190:2888     ESTABLISHED 19/envoy            
tcp6       0      0 172.30.244.96:34530     172.30.196.190:2888     ESTABLISHED -                   
istio-proxy@zookeeper-0:/$ netstat -pan | grep 2181
tcp6       0      0 :::2181                 :::*                    LISTEN      -                   
tcp6       0      0 127.0.0.1:39262         127.0.0.1:2181          TIME_WAIT   -                   
tcp6       0      0 127.0.0.1:39044         127.0.0.1:2181          TIME_WAIT   - 

Found a workaround for this… zookeeper has 3 ports: 2181/TCP,3888/TCP,2888/TCP

2181 is for client connnections, and 3888/2888 are both used internally for leader election and followers. I went ahead and excluded 3888/2888 for inbound ports, e.g.

spec:
  serviceName: zookeeper-headless
  replicas: 3
  selector:
    matchLabels:
      app: zookeeper
      release: zookeeper
      component: server
  updateStrategy:
    type: RollingUpdate
    
  template:
    metadata:
      annotations:
        traffic.sidecar.istio.io/excludeInboundPorts: "2888,3888"

and redeployed the statefulset. After that, all my zookeeper pods are coming up fine and the quorum are established.

yes, you should not set quorumListenOnAllIPs in istio 1.10. quorumListenOnAllIPs is an experimental flag from zookeeper and not recommended for production anyway.

This looks like the issues we have with apps that do not listen on local host. This can be changed with updating one or more configuration parameters for a given app. Looking at zookeeper docs (https://zookeeper.apache.org/doc/r3.3.5/zookeeperAdmin.html#sc_configuration) I see:

clientPortAddress
New in 3.3.0: the address (ipv4, ipv6 or hostname) to listen for client connections; that is, the address that clients attempt to connect to. This is optional, by default we bind in such a way that any connection to the clientPort for any address/interface/nic on the server will be accepted.

looking into it.

The root cause is that zookeeper listens on pod ip only

# kubectl exec -ti  zk-1 -c istio-proxy sh
$ netstat -pan |grep 3888
tcp6       0      0 10.244.0.95:3888        :::*                    LISTEN      - 

Ref: https://istio.io/faq/applications/#cassandra

This is really a bad UX, @rshriram @howardjohn @lambdai any idea how can we solve this?