metallb: Connections hang and timeout intermittently
Bug Report Connections hang and timeout intermittently
What happened: Curl intermittently hangs indefinitely for service External-IP, e.g. for:
> kubectl get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
XXX LoadBalancer 10.104.238.24 172.16.25.100 80:30997/TCP,4949:32000/TCP 29d
We see this behaivour:
> curl 172.16.25.103/ping
OK
> 172.16.25.103/ping
curl: (7) Failed to connect to 172.16.25.103 port 80: Operation timed out
And inside the cluster:
> curl 10.104.238.24/ping
OK
> curl 10.104.238.24/ping
OK
> curl 10.104.238.24/ping
OK
It works every time.
What you expected to happen: I would expect the External-IP to perform as close to identical to the Cluster-IP as possible.
How to reproduce it (as minimally and precisely as possible): Curl an address being announced by metallb and eventually the request will hang and timeout.
Anything else we need to know?: Not sure if it’s relevant bug here’s a sample log from one of the speakers:
{"caller":"arp.go:102","interface":"bond0.16","ip":"172.16.13.90","msg":"got ARP request for service IP, sending response","responseMAC":"00:25:XXX","senderIP":"172.16.4.43","senderMAC":"00:23:XXX","ts":"2019-01-14T22:12:57.113625497Z"}
{"caller":"arp.go:102","interface":"enp1s0f1","ip":"172.16.13.90","msg":"got ARP request for service IP, sending response","responseMAC":"00:25:XXX","senderIP":"172.16.4.43","senderMAC":"00:23:XXX","ts":"2019-01-14T22:12:57.113625477Z"}
Environment:
- MetalLB version: 0.7.3
- Kubernetes version: 1.13.2
- BGP router type/version: Layer2
- OS (e.g. from /etc/os-release):
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
- Kernel (e.g.
uname -a
): 4.19.11-1.el7.elrepo.x86_64
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 15 (2 by maintainers)
Has anyone found a solution to this problem? We are having a similar issue problem with the only difference that we see this issue when we have more than 1 replica of the Traefik ingress controller running in our K8s cluster. This is major issue for us. I have searched the #traefik and #metallb slack channels and have come up with nothing.
We are running K8s v1.18 and Traefik 2.3.2.
I have also seen this issue in clusters where kube-proxy’s iptables get out of sync. What you can do to test this is to use something like the following script to test whether all the nodes in the cluster’s iptables rules (as configured by kube-proxy) is working as it should be.
Just change this part:
EXTERNAL_SERVICE_FQDN=“<Add the FQDN of the URL you would like to access. >”