metallb: Connections hang and timeout intermittently

Bug Report Connections hang and timeout intermittently

What happened: Curl intermittently hangs indefinitely for service External-IP, e.g. for:

> kubectl get services
NAME            TYPE           CLUSTER-IP       EXTERNAL-IP     PORT(S)                       AGE
XXX        LoadBalancer   10.104.238.24    172.16.25.100   80:30997/TCP,4949:32000/TCP   29d

We see this behaivour:

> curl 172.16.25.103/ping
OK
> 172.16.25.103/ping
curl: (7) Failed to connect to 172.16.25.103 port 80: Operation timed out

And inside the cluster:

> curl 10.104.238.24/ping
OK
> curl 10.104.238.24/ping
OK
> curl 10.104.238.24/ping
OK

It works every time.

What you expected to happen: I would expect the External-IP to perform as close to identical to the Cluster-IP as possible.

How to reproduce it (as minimally and precisely as possible): Curl an address being announced by metallb and eventually the request will hang and timeout.

Anything else we need to know?: Not sure if it’s relevant bug here’s a sample log from one of the speakers:

{"caller":"arp.go:102","interface":"bond0.16","ip":"172.16.13.90","msg":"got ARP request for service IP, sending response","responseMAC":"00:25:XXX","senderIP":"172.16.4.43","senderMAC":"00:23:XXX","ts":"2019-01-14T22:12:57.113625497Z"}
{"caller":"arp.go:102","interface":"enp1s0f1","ip":"172.16.13.90","msg":"got ARP request for service IP, sending response","responseMAC":"00:25:XXX","senderIP":"172.16.4.43","senderMAC":"00:23:XXX","ts":"2019-01-14T22:12:57.113625477Z"}

Environment:

MetalLB version: 0.7.3
Kubernetes version: 1.13.2
BGP router type/version: Layer2
OS (e.g. from /etc/os-release):

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Kernel (e.g. uname -a): 4.19.11-1.el7.elrepo.x86_64

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 15 (2 by maintainers)

Most upvoted comments

Has anyone found a solution to this problem? We are having a similar issue problem with the only difference that we see this issue when we have more than 1 replica of the Traefik ingress controller running in our K8s cluster. This is major issue for us. I have searched the #traefik and #metallb slack channels and have come up with nothing.

We are running K8s v1.18 and Traefik 2.3.2.

shaggykp on Dec 11, 2020

I have also seen this issue in clusters where kube-proxy’s iptables get out of sync. What you can do to test this is to use something like the following script to test whether all the nodes in the cluster’s iptables rules (as configured by kube-proxy) is working as it should be.

Just change this part:
EXTERNAL_SERVICE_FQDN=“<Add the FQDN of the URL you would like to access. >”

#!/bin/bash

set -e

# Test whether all the nodes have a current iptables listing
# Riaan Labuschagne
# 27012019
# v1.0

# Some vars
EXTERNAL_SERVICE_FQDN="<Add the FQDN of the URL you would like to access. >"
K8S_NODES="$( kubectl get nodes -o name | cut -d\/ -f2  )"
KONGA_IP=$(dig +short ${EXTERNAL_SERVICE_FQDN} | tail -n 1 )

# Check each node
for K8S_NODE in ${K8S_NODES} ; do
  echo ""
  CHECK_STATUS="PASSED"
  echo " Checking routing in ${K8S_NODE} ..."
  for INCREMENT in {1..4} ; do
    if ! ssh -n -o ConnectTimeout=2  root@${K8S_NODE} "nc -w 2 -zvnt  ${KONGA_IP} 443" ; then
      CHECK_STATUS="FAILED"
    fi
  echo "${K8S_NODE} has ${CHECK_STATUS} the routing test check on the ${INCREMENT} increment."
  done
  if [[ "${CHECK_STATUS}" == "FAILED"  ]]; then
    echo "  ERROR: iptables routing is out of sync on ${K8S_NODE}"
    echo "  Restarting kube-proxy on node ..."
    kubectl get pods --all-namespaces -o wide | grep kube-proxy | grep $( echo ${K8S_NODE}  | cut -d\. -f1  )  |  awk '{print $1,"",$2}' | while read LINE ;do kubectl delete pods -n $LINE  ; done
    echo "  kube-proxy restarted on ${K8S_NODE}"
  else
    echo "kube-proxy routing is working in ${K8S_NODE}"
  fi
done

RiaanLab on Feb 2, 2019