contour: Envoy does not receive update from Contour anymore
What steps did you take and what happened: The problem is intermittent. Sometimes, after deploying workload or adding new ingressRoute, envoy do not update its configuration on all pods. It does not happen all the time and when it happens, it seems that Envoy and Contour are stuck in a state that only killing Envoy or Contour pod can fix the issue. However, only a few pods are not updated. Thus that is enough to receive 503 response from envoy because the IP of the target pod is not up to date.
What did you expect to happen: I do realize that network issue can happen and thus loss of connection between Envoy and Contour can happen. However, It should reconnect properly without any human intervention. Anything else you would like to add: So our setup:
- Envoy is deployed as a daemonset (on around 15 nodes)
- Contour is deployed using a deployment of 1 replica I took a look at the logs of Contour and Envoy and there is nothing substantial. We do have these logs
forcing update" context=HoldoffNotifier last_update=5.75642593s pending=1 performing delayed
update" context=HoldoffNotifier last_update=237.860631ms pending=39 ...
By looking at the Envoy metrics, we could see a lot of configuration update Failure (RDS and CDS). Envoy Configuration for Contour
"static_resources": {
"clusters": [
{
"name": "contour",
"type": "STRICT_DNS",
"connect_timeout": "5s",
"circuit_breakers": {
"thresholds": [
{
"priority": "HIGH",
"max_connections": 100000,
"max_pending_requests": 100000,
"max_requests": 60000000,
"max_retries": 50
},
{
"max_connections": 100000,
"max_pending_requests": 100000,
"max_requests": 60000000,
"max_retries": 50
}
]
},
"http2_protocol_options": {},
"alt_stat_name": "heptio-contour_contour_8001",
"load_assignment": {
"cluster_name": "contour",
"endpoints": [
{
"lb_endpoints": [
{
"endpoint": {
"address": {
"socket_address": {
"address": "contour",
"port_value": 8001
}
}
}
}
]
}
]
}
},
Cluster
----------------------
contour::default_priority::max_connections::100000
contour::default_priority::max_pending_requests::100000
contour::default_priority::max_requests::60000000 contour::default_priority::max_retries::50
contour::high_priority::max_connections::100000
contour::high_priority::max_pending_requests::100000
contour::high_priority::max_requests::60000000
contour::high_priority::max_retries::50
Environment:
- Contour version: 0.13.0
- Envoy: 1.10
- Kubernetes version: 1.13
- Rancher : 2.2.3
- Cloud provider : Open Stack
- OS : 4.19.43-coreos
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 27 (21 by maintainers)
I spent some time looking at the
netstat
information on each side (contour and envoy), and at this point it’s pretty clear that 15 out of thr 16 TCP connections are half-closed. This explains with Envoy is not receiving any new XDS updates over gRPC.Below is a recap of the investigation.
First, we locate the node where the Contour pod is running:
Then we SSH to the node (
runway-uks-ncsa-east-ne1-worker8
), and find contour’s Docker container id:We use
netstat
to list open connection within that network namespace, viansenter
:We are looking for local address
0.0.0.0:8001
, and we see only one open connection:Already we know there’s a problem, because we would expect N open connections for N running Envoy pods (N=16 in our environment).
The Foreign Address of
192.168.5.0:51586
identifies the node within the overlay network. For this, we check thespec.podCIDR
attribute of every node and look for192.168.5.0/24
:Now we know that the only connected Envoy pod is the one running on
runway-uks-ncsa-east-ne1-etcd4
.We confirm this from the other side by SSH to the node. First, we find the docker container for Envoy:
Knowing the container id, we resolve its PID and then list open connections with
nestat
, viansenter
:The destination address is
192.168.138.21:8001
which matches the cluster service IP:This connection is legit. It is
ESTABLISHED
on both sides and the addresses are those we expect.Next, we try any other node and see if it thinks it has an open connection from Envoy to Contour. For example,
runway-uks-ncsa-east-ne1-worker1
:We run
netstat
viansenter
, looking for port:8001
:Indeed, this side thinks it is connected to
192.168.138.21:8001
, but there is no sign of this connection on the other side. Envoy thinks it has a live connection, but that’s not true.At this point, we have identified a half open (or is it half-closed?) TCP connection, which is not discoverable from this side of the connection without a keep-alive strategy.
This can be shortened to a one-liner and tried on each of the 16 nodes:
We find that 16 of them claim to be connected to Contour, but only one is actually connected because Contour only sees one
ESTABLISHED
connection.Since Envoy is the client to Contour, there is no way for Contour to force the client to re-establish a TCP connection.
I’ve seen similar issues in #1514 which I’m actively working on now, but interested to see you’re running v0.13 and are seeing this which makes me think it’s not the same issue.
I’ll post back with the resolution to my issue, possibly that will resolve your problems as well.