contour: Envoy does not receive update from Contour anymore

What steps did you take and what happened: The problem is intermittent. Sometimes, after deploying workload or adding new ingressRoute, envoy do not update its configuration on all pods. It does not happen all the time and when it happens, it seems that Envoy and Contour are stuck in a state that only killing Envoy or Contour pod can fix the issue. However, only a few pods are not updated. Thus that is enough to receive 503 response from envoy because the IP of the target pod is not up to date.

What did you expect to happen: I do realize that network issue can happen and thus loss of connection between Envoy and Contour can happen. However, It should reconnect properly without any human intervention. Anything else you would like to add: So our setup:

Envoy is deployed as a daemonset (on around 15 nodes)
Contour is deployed using a deployment of 1 replica I took a look at the logs of Contour and Envoy and there is nothing substantial. We do have these logs

forcing update" context=HoldoffNotifier last_update=5.75642593s pending=1 performing delayed 
update" context=HoldoffNotifier last_update=237.860631ms pending=39 ...

By looking at the Envoy metrics, we could see a lot of configuration update Failure (RDS and CDS). Envoy Configuration for Contour

"static_resources": {
    "clusters": [
     {
      "name": "contour",
      "type": "STRICT_DNS",
      "connect_timeout": "5s",
      "circuit_breakers": {
       "thresholds": [
        {
         "priority": "HIGH",
         "max_connections": 100000,
         "max_pending_requests": 100000,
         "max_requests": 60000000,
         "max_retries": 50
        },
        {
         "max_connections": 100000,
         "max_pending_requests": 100000,
         "max_requests": 60000000,
         "max_retries": 50
        }
       ]
      },
      "http2_protocol_options": {},
      "alt_stat_name": "heptio-contour_contour_8001",
      "load_assignment": {
       "cluster_name": "contour",
       "endpoints": [
        {
         "lb_endpoints": [
          {
           "endpoint": {
            "address": {
             "socket_address": {
              "address": "contour",
              "port_value": 8001
             }
            }
           }
          }
         ]
        }
       ]
      }
     },

Cluster 
----------------------
contour::default_priority::max_connections::100000 
contour::default_priority::max_pending_requests::100000 
contour::default_priority::max_requests::60000000 contour::default_priority::max_retries::50 
contour::high_priority::max_connections::100000 
contour::high_priority::max_pending_requests::100000 
contour::high_priority::max_requests::60000000
 contour::high_priority::max_retries::50

Environment:

Contour version: 0.13.0
Envoy: 1.10
Kubernetes version: 1.13
Rancher : 2.2.3
Cloud provider : Open Stack
OS : 4.19.43-coreos

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 27 (21 by maintainers)

Most upvoted comments

I spent some time looking at the netstat information on each side (contour and envoy), and at this point it’s pretty clear that 15 out of thr 16 TCP connections are half-closed. This explains with Envoy is not receiving any new XDS updates over gRPC.

Below is a recap of the investigation.

First, we locate the node where the Contour pod is running:

$ kubectl get pod -l app=contour -n runway-ingress -o json | jq '.items[0].spec.nodeName'
"runway-uks-ncsa-east-ne1-worker8"

Then we SSH to the node (runway-uks-ncsa-east-ne1-worker8), and find contour’s Docker container id:

runway-uks-ncsa-east-ne1-worker8$ docker ps -f "label=io.kubernetes.container.name=contour"
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS               NAMES
8a87e0d7b76d        978cfcd67636        "contour serve --inc…"   12 days ago         Up 12 days                              k8s_contour_contour-5cf94978b5-sqht6_runway-ingress_19c6ebe4-ea06-11e9-bc68-fa163e79a2f2_0

We use netstat to list open connection within that network namespace, via nsenter:

runway-uks-ncsa-east-ne1-worker8$ nsenter -t $(docker inspect -f '{{.State.Pid}}' 8a87e0d7b76d) -n netstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 runway-uks-ncsa-e:42298 192.168.128.1:https     ESTABLISHED
tcp6       0      0 192.168.10.66:8000      192.168.14.11:57150     TIME_WAIT
tcp6       0      0 192.168.10.66:8001      192.168.5.0:51586       ESTABLISHED
tcp6       0      0 192.168.10.66:8000      testy.testzonecml:49496 TIME_WAIT
tcp6       0      0 192.168.10.66:8000      192.168.14.11:56938     TIME_WAIT
tcp6       0      0 192.168.10.66:8000      testy.testzonecml:49756 TIME_WAIT
Active UNIX domain sockets (w/o servers)
Proto RefCnt Flags       Type       State         I-Node   Path

We are looking for local address 0.0.0.0:8001, and we see only one open connection:

tcp6       0      0 192.168.10.66:8001      192.168.5.0:51586       ESTABLISHED

Already we know there’s a problem, because we would expect N open connections for N running Envoy pods (N=16 in our environment).

The Foreign Address of 192.168.5.0:51586 identifies the node within the overlay network. For this, we check the spec.podCIDR attribute of every node and look for 192.168.5.0/24:

$ kubectl get nodes -o json | jq '.items[] | select(.spec.podCIDR == "192.168.5.0/24") | .metadata.name'
"runway-uks-ncsa-east-ne1-etcd4"

Now we know that the only connected Envoy pod is the one running on runway-uks-ncsa-east-ne1-etcd4.

We confirm this from the other side by SSH to the node. First, we find the docker container for Envoy:

runway-uks-ncsa-east-ne1-etcd4$ docker ps -f "label=io.kubernetes.container.name=envoy"
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS               NAMES
012018109ad8        d3340c53fdcd        "envoy -c /config/co…"   2 weeks ago         Up 2 weeks                              k8s_envoy_envoy-gwc6q_runway-ingress_a32ed7ee-e6dc-11e9-bc68-fa163e79a2f2_0

Knowing the container id, we resolve its PID and then list open connections with nestat, via nsenter:

runway-uks-ncsa-east-ne1-etcd4$ nsenter -t $(docker inspect -f '{{.State.Pid}}' 012018109ad8) -n netstat | grep 8001
tcp        0      0 runway-uks-ncsa-e:51586 192.168.138.21:8001     ESTABLISHED

The destination address is 192.168.138.21:8001 which matches the cluster service IP:

$ kubectl get svc contour -n runway-ingress
NAME      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
contour   ClusterIP   192.168.138.21   <none>        8001/TCP   17d

This connection is legit. It is ESTABLISHED on both sides and the addresses are those we expect.

Next, we try any other node and see if it thinks it has an open connection from Envoy to Contour. For example, runway-uks-ncsa-east-ne1-worker1:

runway-uks-ncsa-east-ne1-worker1$ docker ps -f "label=io.kubernetes.container.name=envoy"
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS               NAMES
b4911deeb5c8        d3340c53fdcd        "envoy -c /config/co…"   2 weeks ago         Up 2 weeks                              k8s_envoy_envoy-xcf9j_runway-ingress_b6896a5f-e6dc-11e9-bc68-fa163e79a2f2_0

We run netstat via nsenter, looking for port :8001:

runway-uks-ncsa-east-ne1-worker1$ nsenter -t $(docker inspect -f '{{.State.Pid}}' b4911deeb5c8) -n netstat | grep :8001
tcp        0      0 runway-uks-ncsa-e:37230 192.168.138.21:8001     ESTABLISHED

Indeed, this side thinks it is connected to 192.168.138.21:8001, but there is no sign of this connection on the other side. Envoy thinks it has a live connection, but that’s not true.

At this point, we have identified a half open (or is it half-closed?) TCP connection, which is not discoverable from this side of the connection without a keep-alive strategy.

This can be shortened to a one-liner and tried on each of the 16 nodes:

$ nsenter -t $(docker inspect -f '{{.State.Pid}}' $(docker ps -qf 'label=io.kubernetes.container.name=envoy')) -n netstat | grep tcp | grep :8001

We find that 16 of them claim to be connected to Contour, but only one is actually connected because Contour only sees one ESTABLISHED connection.

Since Envoy is the client to Contour, there is no way for Contour to force the client to re-establish a TCP connection.

bgagnon on Oct 21, 2019

I’ve seen similar issues in #1514 which I’m actively working on now, but interested to see you’re running v0.13 and are seeing this which makes me think it’s not the same issue.

I’ll post back with the resolution to my issue, possibly that will resolve your problems as well.

stevesloka on Sep 17, 2019