linkerd2: MariaDB Galera gcomm protocol does not work with opaque linkerd-edge-21.1.3

What is the issue?

MariaDB client connection over port 3306 and galera port 4444 are meshed perfectly using the new opaque annotation Galera connectivity breaks when these 2 ports are meshed 4567 is reserved for Galera Cluster replication traffic. 4568 is the port for Incremental State Transfer.

How can it be reproduced?

Setup a vanilla MariaDB galera cluster using bitnami helm chart Mesh the cluster with the following annotation

     config.linkerd.io/opaque-ports: "4568, 4444,4567,3306"

Logs, error output, etc

MariaDB galera

2021-02-01 23:42:58 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
	 at gcomm/src/pc.cpp:connect():160
2021-02-01 23:42:58 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():220: Failed to open backend connection: -110 (Connection timed out)
2021-02-01 23:42:58 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1632: Failed to open channel 'galera' at 'gcomm://maria-mb-mariadb-galera-headless.default.svc.cluster.local': -110 (Connection timed out)
2021-02-01 23:42:58 0 [ERROR] WSREP: gcs connect failed: Connection timed out
2021-02-01 23:42:58 0 [ERROR] WSREP: wsrep::connect(gcomm://maria-mb-mariadb-galera-headless.default.svc.cluster.local) failed: 7
2021-02-01 23:42:58 0 [ERROR] Aborting

linkerd-proxy

[    33.160529s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:54426 target.addr=10.96.0.105:4567}: linkerd_proxy_http::detect: Could not detect protocol read=0
[    33.160673s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:54426 target.addr=10.96.0.105:4567}: linkerd_detect: Detected protocol=None elapsed=3.78507ms
[    33.160890s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:41766 target.addr=10.96.0.105:4567}:tcp: linkerd_tls::client: Peer does not support TLS reason=not_provided_by_service_discovery
[    33.160976s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:41766 target.addr=10.96.0.105:4567}:tcp: linkerd_proxy_transport::connect: Connecting peer.addr=10.96.0.105:4567
[    33.161332s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:41766 target.addr=10.96.0.105:4567}:tcp: linkerd_proxy_transport::connect: Connected local.addr=10.96.1.77:54432 keepalive=Some(10s)
[    33.161426s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:41766 target.addr=10.96.0.105:4567}:tcp: linkerd_proxy_transport::metrics: client connection open
[    33.162021s]  INFO ThreadId(01) outbound:accept{peer.addr=10.96.1.77:54426 target.addr=10.96.0.105:4567}: linkerd_app_core::serve: Connection closed error=server: Transport endpoint is not connected (os error 107)
[    37.209900s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:44964 target.addr=10.96.1.77:4191}: linkerd_tls::server: Peeked bytes from TCP stream sz=117
[    37.210226s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:44964 target.addr=10.96.1.77:4191}: linkerd_app_core::serve: Connection closed
[    38.026182s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:54750 target.addr=10.96.0.105:4567}: linkerd_proxy_http::detect: Could not detect protocol read=0
[    38.026219s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:54750 target.addr=10.96.0.105:4567}: linkerd_detect: Detected protocol=None elapsed=3.001404001s
[    38.026294s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:41766 target.addr=10.96.0.105:4567}:tcp: linkerd_tls::client: Peer does not support TLS reason=not_provided_by_service_discovery
[    38.026309s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:41766 target.addr=10.96.0.105:4567}:tcp: linkerd_proxy_transport::connect: Connecting peer.addr=10.96.0.105:4567
[    38.026697s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:41766 target.addr=10.96.0.105:4567}:tcp: linkerd_proxy_transport::connect: Connected local.addr=10.96.1.77:56448 keepalive=Some(10s)
[    38.026741s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:41766 target.addr=10.96.0.105:4567}:tcp: linkerd_proxy_transport::metrics: client connection open
[    38.027264s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:54750 target.addr=10.96.0.105:4567}: linkerd_app_core::serve: Connection closed
[    41.011166s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:47324 target.addr=10.96.1.77:4191}: linkerd_tls::server: Peeked bytes from TCP stream sz=118
[    41.011369s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:47324 target.addr=10.96.1.77:4191}: linkerd_app_core::serve: Connection closed
[    42.028795s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:56804 target.addr=10.96.0.105:4567}: linkerd_proxy_http::detect: Could not detect protocol read=0
[    42.028832s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:56804 target.addr=10.96.0.105:4567}: linkerd_detect: Detected protocol=None elapsed=3.002261786s
[    42.029022s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:41766 target.addr=10.96.0.105:4567}:tcp: linkerd_tls::client: Peer does not support TLS reason=not_provided_by_service_discovery
[    42.029039s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:41766 target.addr=10.96.0.105:4567}:tcp: linkerd_proxy_transport::connect: Connecting peer.addr=10.96.0.105:4567
[    42.030778s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:41766 target.addr=10.96.0.105:4567}:tcp: linkerd_proxy_transport::connect: Connected local.addr=10.96.1.77:58150 keepalive=Some(10s)
[   352.706744s] DEBUG ThreadId(01) poll_profile: linkerd_service_profiles::client: profile received: DestinationProfile { fully_qualified_name: "", opaque_protocol: false, routes: [], retry_budget: Some(RetryBudget { retry_ratio: 0.2, min_retries_per_second: 10, ttl: Some(Duration { seconds: 10, nanos: 0 }) }), dst_overrides: [], endpoint: Some(WeightedAddr { addr: Some(TcpAddress { ip: Some(IpAddress { ip: Some(Ipv4(174064242)) }), port: 4567 }), weight: 10000, metric_labels: {"statefulset": "maria-mb-mariadb-galera", "serviceaccount": "default", "pod": "maria-mb-mariadb-galera-0", "namespace": "default"}, tls_identity: None, protocol_hint: None, authority_override: None }) }
[   352.719642s] DEBUG ThreadId(01) dst: linkerd_dns: resolve_a name=linkerd-dst-headless.linkerd.svc.cluster.local
[   352.741682s] DEBUG ThreadId(01) dst: linkerd_proxy_dns_resolve: addrs=[10.96.2.99:8086]
[   355.704747s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.79:53922 target.addr=10.96.2.114:4567}: linkerd_proxy_http::detect: Could not detect protocol read=0
[   355.704794s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.79:53922 target.addr=10.96.2.114:4567}: linkerd_detect: Detected protocol=None elapsed=2.997837612s
[   355.704811s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.79:53922 target.addr=10.96.2.114:4567}: linkerd_cache: Caching new service target=(None, Logical { orig_dst: 10.96.2.114:4567, protocol: (), profile: Some(..) })
[   355.704958s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.79:53922 target.addr=10.96.2.114:4567}:tcp: linkerd_tls::client: Peer does not support TLS reason=not_provided_by_service_discovery
[   355.704971s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.79:53922 target.addr=10.96.2.114:4567}:tcp: linkerd_proxy_transport::connect: Connecting peer.addr=10.96.2.114:4567
[   355.706766s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.79:53922 target.addr=10.96.2.114:4567}:tcp: linkerd_proxy_transport::connect: Connected local.addr=10.96.1.79:54284 keepalive=Some(10s)
[   355.706801s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.79:53922 target.addr=10.96.2.114:4567}:tcp: linkerd_proxy_transport::metrics: client connection open
[   355.707402s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.79:53922 target.addr=10.96.2.114:4567}: linkerd_app_core::serve: Connection closed
[   355.707431s] DEBUG ThreadId(01) evict{target=(None, Logical { orig_dst: 10.96.2.114:4567, protocol: (), profile: Some(..) })}: linkerd_cache: Awaiting idleness
[   355.707456s] DEBUG ThreadId(01) evict{target=Accept { orig_dst: 10.96.2.114:4567, protocol: () }}: linkerd_cache: Awaiting idleness
[   356.215757s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:38092 target.addr=10.96.1.79:4191}: linkerd_tls::server: Peeked bytes from TCP stream sz=118
[   356.215986s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:38092 target.addr=10.96.1.79:4191}: linkerd_app_core::serve: Connection closed

If I skip port 4567 and keep the rest in opaque

        config.linkerd.io/opaque-ports: 3306,4444,4568
        config.linkerd.io/proxy-log-level: linkerd=debug,warn
        config.linkerd.io/skip-inbound-ports: "4567"
        config.linkerd.io/skip-outbound-ports: "4567"

I get this linkerd-proxy

[   358.027677s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:39648 target.addr=10.96.2.114:4567}:tcp: linkerd_proxy_transport::connect: Connected local.addr=10.96.1.77:34574 keepalive=Some(10s)
[   358.027734s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:39648 target.addr=10.96.2.114:4567}:tcp: linkerd_proxy_transport::metrics: client connection open
[   358.028158s]  INFO ThreadId(01) outbound:accept{peer.addr=10.96.1.77:47134 target.addr=10.96.2.114:4567}: linkerd_app_core::serve: Connection closed error=server: Transport endpoint is not connected (os error 107)
[   361.011275s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:48768 target.addr=10.96.1.77:4191}: linkerd_tls::server: Peeked bytes from TCP stream sz=118
[   361.011473s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:48768 target.addr=10.96.1.77:4191}: linkerd_app_core::serve: Connection closed
[   362.027728s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:47702 target.addr=10.96.2.114:4567}: linkerd_proxy_http::detect: Could not detect protocol read=0
[   362.027767s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:47702 target.addr=10.96.2.114:4567}: linkerd_detect: Detected protocol=None elapsed=3.000316309s
[   362.027837s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:39648 target.addr=10.96.2.114:4567}:tcp: linkerd_tls::client: Peer does not support TLS reason=not_provided_by_service_discovery
[   362.027858s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:39648 target.addr=10.96.2.114:4567}:tcp: linkerd_proxy_transport::connect: Connecting peer.addr=10.96.2.114:4567
[   362.028277s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:39648 target.addr=10.96.2.114:4567}:tcp: linkerd_proxy_transport::connect: Connected local.addr=10.96.1.77:37212 keepalive=Some(10s)
[   362.028301s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:39648 target.addr=10.96.2.114:4567}:tcp: linkerd_proxy_transport::metrics: client connection open
[   362.028867s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:47702 target.addr=10.96.2.114:4567}: linkerd_app_core::serve: Connection closed
[   366.028201s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:48944 target.addr=10.96.2.114:4567}: linkerd_proxy_http::detect: Could not detect protocol read=0
[   366.028238s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:48944 target.addr=10.96.2.114:4567}: linkerd_detect: Detected protocol=None elapsed=3.000246505s
[   366.028309s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:39648 target.addr=10.96.2.114:4567}:tcp: linkerd_tls::client: Peer does not support TLS reason=not_provided_by_service_discovery
[   366.028339s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:39648 target.addr=10.96.2.114:4567}:tcp: linkerd_proxy_transport::connect: Connecting peer.addr=10.96.2.114:4567
[   366.028774s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:39648 target.addr=10.96.2.114:4567}:tcp: linkerd_proxy_transport::connect: Connected local.addr=10.96.1.77:38888 keepalive=Some(10s)
[   366.028804s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:39648 target.addr=10.96.2.114:4567}:tcp: linkerd_proxy_transport::metrics: client connection open
[   366.029179s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:48944 target.addr=10.96.2.114:4567}: linkerd_app_core::serve: Connection closed
[   367.209485s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:58334 target.addr=10.96.1.77:4191}: linkerd_tls::server: Peeked bytes from TCP stream sz=117
[   367.209661s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:58334 target.addr=10.96.1.77:4191}: linkerd_app_core::serve: Connection closed
[   368.032129s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:53830 target.addr=10.96.2.114:4567}: linkerd_proxy_http::detect: Could not detect protocol read=0
[   368.032171s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:53830 target.addr=10.96.2.114:4567}: linkerd_detect: Detected protocol=None elapsed=503.559811ms
[   368.032242s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:39648 target.addr=10.96.2.114:4567}:tcp: linkerd_tls::client: Peer does not support TLS reason=not_provided_by_service_discovery
[   368.032260s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:39648 target.addr=10.96.2.114:4567}:tcp: linkerd_proxy_transport::connect: Connecting peer.addr=10.96.2.114:4567
[   368.032759s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:39648 target.addr=10.96.2.114:4567}:tcp: linkerd_proxy_transport::connect: Connected local.addr=10.96.1.77:39062 keepalive=Some(10s)
[   368.032789s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:39648 target.addr=10.96.2.114:4567}:tcp: linkerd_proxy_transport::metrics: client connection open
[   368.033258s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.96.1.77:53830 target.addr=10.96.2.114:4567}: linkerd_app_core::serve: Connection closed
[   371.011231s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:48878 target.addr=10.96.1.77:4191}: linkerd_tls::server: Peeked bytes from TCP stream sz=118
[   371.011416s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:48878 target.addr=10.96.1.77:4191}: linkerd_app_core::serve: Connection closed
[   373.034462s] DEBUG ThreadId(01) evict{target=(None, Logical { orig_dst: 10.96.2.114:4567, protocol: (), profile: Some(..) })}: linkerd_cache: Cache entry dropped
[   373.034555s] DEBUG ThreadId(01) evict{target=Accept { orig_dst: 10.96.2.114:4567, protocol: () }}: linkerd_cache: Cache entry dropped
[   377.209477s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:51376 target.addr=10.96.1.77:4191}: linkerd_tls::server: Peeked bytes from TCP stream sz=117
[   377.209679s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:51376 target.addr=10.96.1.77:4191}: linkerd_app_core::serve: Connection closed
[   381.011413s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:52776 target.addr=10.96.1.77:4191}: linkerd_tls::server: Peeked bytes from TCP stream sz=118
[   381.012072s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:52776 target.addr=10.96.1.77:4191}: linkerd_app_core::serve: Connection closed
[   387.209370s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:40766 target.addr=10.96.1.77:4191}: linkerd_tls::server: Peeked bytes from TCP stream sz=117
[   387.209625s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:40766 target.addr=10.96.1.77:4191}: linkerd_app_core::serve: Connection closed
[   391.011323s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:40774 target.addr=10.96.1.77:4191}: linkerd_tls::server: Peeked bytes from TCP stream sz=118
[   391.011661s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.96.1.1:40774 target.addr=10.96.1.77:4191}: linkerd_app_core::serve: Connection closed

mariaDB log

2021-02-01 23:49:01 0 [Note] WSREP: (0f041dd0-a4ae, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2021-02-01 23:49:01 0 [Note] WSREP: (0f041dd0-a4ae, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2021-02-01 23:49:01 0 [Note] WSREP: EVS version 1
2021-02-01 23:49:01 0 [Note] WSREP: gcomm: connecting to group 'galera', peer 'maria-mb-mariadb-galera-headless.default.svc.cluster.local:'
2021-02-01 23:49:04 0 [Note] WSREP: (0f041dd0-a4ae, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.96.0.105:4567 timed out, no messages seen in PT3S, socket stats: rtt: 31 rttvar: 15 rto: 200000 lost: 0 last_data_recv: 3001 cwnd: 10 last_queued_since: 3000339972 last_delivered_since: 3000339972 send_queue_length: 0 send_queue_bytes: 0
2021-02-01 23:49:04 0 [Note] WSREP: EVS version upgrade 0 -> 1
2021-02-01 23:49:04 0 [Note] WSREP: PC protocol upgrade 0 -> 1
2021-02-01 23:49:04 0 [Warning] WSREP: no nodes coming from prim view, prim not possible
2021-02-01 23:49:04 0 [Note] WSREP: view(view_id(NON_PRIM,0f041dd0-a4ae,1) memb {
	0f041dd0-a4ae,0
} joined {
} left {
} partitioned {
})
2021-02-01 23:49:04 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.50149S), skipping check
2021-02-01 23:49:08 0 [Note] WSREP: (0f041dd0-a4ae, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.96.0.105:4567 timed out, no messages seen in PT3S, socket stats: rtt: 27 rttvar: 13 rto: 201000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000068129 last_delivered_since: 3000068129 send_queue_length: 0 send_queue_bytes: 0
2021-02-01 23:49:12 0 [Note] WSREP: (0f041dd0-a4ae, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.96.0.105:4567 timed out, no messages seen in PT3S, socket stats: rtt: 36 rttvar: 18 rto: 200000 lost: 0 last_data_recv: 3001 cwnd: 10 last_queued_since: 3000115008 last_delivered_since: 3000115008 send_queue_length: 0 send_queue_bytes: 0
00000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000083006 last_delivered_since: 3000083006 send_queue_length: 0 send_queue_bytes: 0
2021-02-01 23:49:33 0 [Note] WSREP: (0f041dd0-a4ae, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.96.0.105:4567 timed out, no messages seen in PT3S, socket stats: rtt: 41 rttvar: 20 rto: 200000 lost: 0 last_data_recv: 3001 cwnd: 10 last_queued_since: 3000936789 last_delivered_since: 3000936789 send_queue_length: 0 send_queue_bytes: 0
2021-02-01 23:49:34 0 [Note] WSREP: PC protocol downgrade 1 -> 0
2021-02-01 23:49:34 0 [Note] WSREP: view((empty))
2021-02-01 23:49:34 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
	 at gcomm/src/pc.cpp:connect():160
2021-02-01 23:49:34 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():220: Failed to open backend connection: -110 (Connection timed out)
2021-02-01 23:49:34 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1632: Failed to open channel 'galera' at 'gcomm://maria-mb-mariadb-galera-headless.default.svc.cluster.local': -110 (Connection timed out)
2021-02-01 23:49:34 0 [ERROR] WSREP: gcs connect failed: Connection timed out
2021-02-01 23:49:34 0 [ERROR] WSREP: wsrep::connect(gcomm://maria-mb-mariadb-galera-headless.default.svc.cluster.local) failed: 7
2021-02-01 23:49:34 0 [ERROR] Aborting

Environment

  • Kubernetes Version:
  • 1.16
  • Cluster Environment: (GKE, AKS, kops, …) GKE
  • Linkerd version: linkerd-edge-21.1.3

Possible solution

Anything using gcomm protocol must be bypassed in the proxy-init iptables via skip-inbound and skip-outbound port

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 20 (20 by maintainers)

Most upvoted comments

@barkardk That’s great news! I’m so glad that we’ve finally got this working for you 😃 I’d expect stable-2.10.0 shortly (hopefully tomorrow).

Closing this for now but, as always, let us know if you see anything unexpected.

@olix0r I tested using the latest edge as you suggested.
This is what I did a) Injected linkerd into a running galera cluster and wait for the rollingRestart to complete
This resulted in hanging nodes and inconsistent galera cluster state. I do not recommend this
b) scaled replicas down to 1, injected linkerd to the statefulset, restarted the pod and wait for it to come up (this means a little downtime of course), then scale up to 3 replicas. This checks the Incremental state transfer port 4368 communication
Worked perfectly , the new pods joined without any issues
c) Scaled down to 1 pod, force deleted the pvc for the other 2 pods and then scaled up to 3 again. This forces the rsync, that we have selected for our, SST to start up as the nodes are now brand new. Port 4444 communication that is.
Had no issues
d) Randomly restarted pods and forced downscale of one pod for a while to see that galera successfully healed while meshed. No issues

linkerd viz shows this

linkerd-edge-21.3.2 viz edges po
SRC                           DST       SRC_NS        DST_NS    SECURED
mysql-0                       mysql-1   default       default   √
mysql-2                       mysql-0   default       default   √
mysql-2                       mysql-1   default       default   √
prometheus-54574df8b8-nvm86   mysql-0   linkerd-viz   default   √
prometheus-54574df8b8-nvm86   mysql-1   linkerd-viz   default   √
prometheus-54574df8b8-nvm86   mysql-2   linkerd-viz   default   √

So I am going to state that the latest edge solves our galera problem and this issue may be closed.
Thank you so much for all the hard work @olix0r

my steps are as follows

helm repo add bitnami https://charts.bitnami.com/bitnami
helm install mysql bitnami/mariadb-galera
linkerd-edge-21.1.3 install |kubectl apply -f -
k get statefulset <mygaleracluster>  -o yaml | linkerd-edge-21.1.3 inject - | kubectl apply -f -

and then I manually edit the annotations and add this

annotations:
        config.linkerd.io/opaque-ports: 3306,4444,4568,4567
        config.linkerd.io/proxy-log-level: linkerd=debug,warn
        config.linkerd.io/skip-inbound-ports: ""
        config.linkerd.io/skip-outbound-ports: ""
        linkerd.io/inject: enabled

and then I always verify that the iptables rules look correct via k logs mysql-2 -o linkerd-init