envoy: Missed upstream disconnect leading to 503 UC

Hi,

I’ve been chasing a certain issue that is very difficult to reproduce. I’m using envoy v1.8.0 and have an unencrypted HTTP/1.1 local upstream, an HTTP/2 downstream, and use HTTP connection manager. The cluster configuration is

name: "/websso"
connect_timeout {
  seconds: 120
}
hosts {
  socket_address {
    address: "127.0.0.1"
    port_value: 7080
  }
}

and the route match rule is

{"match":{"path":"/websso"}, "route: {"cluster":"/websso","timeout":"28800s","use_websocket":true,"idle_timeout":"28800s"}}}

My issue is that from time to time I get these "503 UC"s, e.g.

2019-03-06T11:02:11.968Z GET /websso HTTP/2 200 - 0 433415 6 5 0 10.197.41.4:57937 10.197.41.64:443 127.0.0.1:60382 127.0.0.1:7080
2019-03-06T11:07:34.910Z GET /websso HTTP/2 503 UC 0 57 0 - - 10.197.41.4:58440 10.197.41.64:443 127.0.0.1:60382 127.0.0.1:7080

The packet capture has this to say for request/response 1.

11:02:11.968219    127.0.0.1             60382    127.0.0.1             7080      TCP      68     60382 → 7080 [SYN] Seq=0 Win=43690 Len=0 MSS=65495 SACK_PERM=1 WS=256
11:02:11.968225    127.0.0.1             7080     127.0.0.1             60382     TCP      68     7080 → 60382 [SYN, ACK] Seq=0 Ack=1 Win=43690 Len=0 MSS=65495 SACK_PERM=1 WS=256
11:02:11.968230    127.0.0.1             60382    127.0.0.1             7080      TCP      56     60382 → 7080 [ACK] Seq=1 Ack=1 Win=43776 Len=0
11:02:11.969252    127.0.0.1             60382    127.0.0.1             7080      TCP      1894   60382 → 7080 [PSH, ACK] Seq=1 Ack=1 Win=43776 Len=1838
11:02:11.969256    127.0.0.1             7080     127.0.0.1             60382     TCP      56     7080 → 60382 [ACK] Seq=1 Ack=1839 Win=174848 Len=0
11:02:11.972626    127.0.0.1             7080     127.0.0.1             60382     TCP      8248   7080 → 60382 [PSH, ACK] Seq=1 Ack=1839 Win=174848 Len=8192
11:02:11.972638    127.0.0.1             60382    127.0.0.1             7080      TCP      56     60382 → 7080 [ACK] Seq=1839 Ack=8193 Win=174848 Len=0
...
11:02:11.973868    127.0.0.1             7080     127.0.0.1             60382     TCP      40548  7080 → 60382 [PSH, ACK] Seq=393094 Ack=1839 Win=174848 Len=40492
11:02:11.973886    127.0.0.1             60382    127.0.0.1             7080      TCP      56     60382 → 7080 [ACK] Seq=1839 Ack=433586 Win=320512 Len=0
11:03:12.094243    127.0.0.1             7080     127.0.0.1             60382     TCP      56     7080 → 60382 [FIN, ACK] Seq=433586 Ack=1839 Win=174848 Len=0
11:03:12.132002    127.0.0.1             60382    127.0.0.1             7080      TCP      56     60382 → 7080 [ACK] Seq=1839 Ack=433587 Win=464896 Len=0

and the following for request/response 2.

11:07:34.910503    127.0.0.1             60382    127.0.0.1             7080      TCP      1933   60382 → 7080 [PSH, ACK] Seq=1 Ack=1 Win=1816 Len=1877
11:07:34.910519    127.0.0.1             7080     127.0.0.1             60382     TCP      56     7080 → 60382 [RST] Seq=1 Win=0 Len=0

It seems that envoy didn’t reply with a FIN and instead pooled the connection and tried to reuse it 4 minutes later which ended badly.

The envoy log contains the following for that connection.

2019-03-06T11:02:11.968Z debug envoy[37657] [Originator@6876 sub=client] source/common/http/codec_client.cc:25 [C715] connecting
2019-03-06T11:02:11.968Z debug envoy[37657] [Originator@6876 sub=connection] source/common/network/connection_impl.cc:632 [C715] connecting to 127.0.0.1:7080
2019-03-06T11:02:11.968Z debug envoy[37657] [Originator@6876 sub=connection] source/common/network/connection_impl.cc:641 [C715] connection in progress
2019-03-06T11:02:11.969Z debug envoy[37657] [Originator@6876 sub=connection] source/common/network/connection_impl.cc:514 [C715] connected
2019-03-06T11:02:11.969Z debug envoy[37657] [Originator@6876 sub=client] source/common/http/codec_client.cc:63 [C715] connected
2019-03-06T11:02:11.969Z debug envoy[37657] [Originator@6876 sub=pool] source/common/http/http1/conn_pool.cc:252 [C715] attaching to next request
2019-03-06T11:02:11.974Z debug envoy[37657] [Originator@6876 sub=client] source/common/http/codec_client.cc:94 [C715] response complete
2019-03-06T11:02:11.974Z debug envoy[37657] [Originator@6876 sub=pool] source/common/http/http1/conn_pool.cc:209 [C715] response complete
2019-03-06T11:02:11.974Z debug envoy[37657] [Originator@6876 sub=pool] source/common/http/http1/conn_pool.cc:247 [C715] moving to ready
2019-03-06T11:07:34.910Z debug envoy[37657] [Originator@6876 sub=pool] source/common/http/http1/conn_pool.cc:90 [C715] using existing connection
2019-03-06T11:07:34.910Z debug envoy[37657] [Originator@6876 sub=connection] source/common/network/connection_impl.cc:499 [C715] remote close
2019-03-06T11:07:34.910Z debug envoy[37657] [Originator@6876 sub=connection] source/common/network/connection_impl.cc:181 [C715] closing socket: 0
2019-03-06T11:07:34.910Z debug envoy[37657] [Originator@6876 sub=client] source/common/http/codec_client.cc:81 [C715] disconnect. resetting 1 pending requests
2019-03-06T11:07:34.910Z debug envoy[37657] [Originator@6876 sub=client] source/common/http/codec_client.cc:104 [C715] request reset
2019-03-06T11:07:34.910Z debug envoy[37657] [Originator@6876 sub=pool] source/common/http/http1/conn_pool.cc:123 [C715] client disconnected

A good connection that happened around that time looks like

2019-03-06T11:02:11.967Z debug envoy[37657] [Originator@6876 sub=client] source/common/http/codec_client.cc:25 [C714] connecting
2019-03-06T11:02:11.967Z debug envoy[37657] [Originator@6876 sub=connection] source/common/network/connection_impl.cc:632 [C714] connecting to 127.0.0.1:7080
2019-03-06T11:02:11.967Z debug envoy[37657] [Originator@6876 sub=connection] source/common/network/connection_impl.cc:641 [C714] connection in progress
2019-03-06T11:02:11.969Z debug envoy[37657] [Originator@6876 sub=connection] source/common/network/connection_impl.cc:514 [C714] connected
2019-03-06T11:02:11.969Z debug envoy[37657] [Originator@6876 sub=client] source/common/http/codec_client.cc:63 [C714] connected
2019-03-06T11:02:11.969Z debug envoy[37657] [Originator@6876 sub=pool] source/common/http/http1/conn_pool.cc:252 [C714] attaching to next request
2019-03-06T11:02:11.973Z debug envoy[37657] [Originator@6876 sub=client] source/common/http/codec_client.cc:94 [C714] response complete
2019-03-06T11:02:11.973Z debug envoy[37657] [Originator@6876 sub=pool] source/common/http/http1/conn_pool.cc:209 [C714] response complete
2019-03-06T11:02:11.973Z debug envoy[37657] [Originator@6876 sub=pool] source/common/http/http1/conn_pool.cc:247 [C714] moving to ready
2019-03-06T11:03:12.758Z debug envoy[37657] [Originator@6876 sub=connection] source/common/network/connection_impl.cc:499 [C714] remote close
2019-03-06T11:03:12.758Z debug envoy[37657] [Originator@6876 sub=connection] source/common/network/connection_impl.cc:181 [C714] closing socket: 0
2019-03-06T11:03:12.758Z debug envoy[37657] [Originator@6876 sub=client] source/common/http/codec_client.cc:81 [C714] disconnect. resetting 0 pending requests
2019-03-06T11:03:12.758Z debug envoy[37657] [Originator@6876 sub=pool] source/common/http/http1/conn_pool.cc:123 [C714] client disconnected

The upstream server has a 1 minute idle timeout and it seems that for C714 and many other connections envoy properly received and processed the disconnect but not for C715.

Have you heard of the above issue before and is there any remedy for it? It is possible to retry once for that particular case only and how?

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 8
  • Comments: 16 (8 by maintainers)

Commits related to this issue

Most upvoted comments

Does anyone know some of the details surrounding what causes this issue to trigger?

I have an Istio environment with 200+ services using identical foundational setups and this bug is only hitting a few of them – one more than the others (which happens to serve slower responses if that’s relevant; something like 30+ seconds). These are node services with a 5s keep-alive which I assume could cause it to hit a little more frequently… but again, this is only really hitting a few services out of the many.

Interestingly, this also doesn’t seem to get caught by 503 retries… using either retriable-status-codes with 503 or 5xx.

I was struggling with half open connections where my node app would send a tcp fin but envoy would still proxy the next get request to the server which would then trigger the node app to send back a tcp rst.

This was not fixed for me in 1.10 but I just pulled the latest envoy-alpine-dev and the issues have gone away, hopefully because envoy after receiving the fin ack from the nodejs app is removing the connection from the connection pool.

From other threads it seems like I need to make my node app (koa plus node v12) do something to send the http header “Connection: Closed” so that it behaves better, will have to look into that more.

If your hitting this and can run on the latest dev code you might want to give this a try.