cloudflared: Many http/2 tunnels on different servers are getting repeatedly disconnected `Connection terminated error="connection with edge closed"`

Describe the bug

We’re a software consultancy that uses many different hosting providers and platforms for a wide variety of client projects (AWS, DigitalOcean, Vultr, etc.). We use cloudflared tunnels for ingress for many of our apps, with probably 30+ tunnels running at any given time on servers around the world.

On March 9th, 2023 all of our Cloudflare tunnels disconnected around the same time. According to the Cloudflare admin dashboard audit log (see screenshot below), all of the tunnels were de-registered due to “origin server went away” or “opened elsewhere”, and then re-registered. Most came up successfully after the outage on their own, but a few never successfully reconnected until we SSH’d in to restart the cloudflared container. They all showed lots of this error in console:

Connection terminated error="connection with edge closed" (which may be a red herring, or may be related, there are so many of these lines it’s hard to tell if they’re a cause or a symptom)

No one had changed anything in our Cloudflare config on March 9th (or even logged into the admin dash prior to the incident), and no one had logged into the servers affected via SSH either, so we’re 99% sure we didn’t do anything to trigger this. It also occurred across multiple hosting providers, so we don’t think it was an outage on DigitalOcean or any one of our VPS providers.

…

On March 15th (edit: and again on March 16th), one of these tunnels randomly started experiencing this error again, causing a critical service to go down for several hours. Nothing has changed on this server during this time, and restarting it fixed it with seemingly no underlying cause and no other applications on the server affected. The server has plenty of memory, cpu, and bandwidth, and none of the other things on the server experienced packet loss when the tunnel went down.

We’re worried this will keep happening and we’ll have to move off Cloudflare entirely, but would really like to avoid that as tunnels are quite nice for one-line docker-based deployment without needing config files/admin dash setup/persistent state/etc.

To Reproduce Steps to reproduce the behavior:

Run cloudflared/cloudflared:2023.3.1 in docker on a DigitalOcean VPS
Wait ?
Tunnel randomly disconnects, restarting the tunnel container fixes it instantly

If it’s an issue with Cloudflare Tunnel: 4. Tunnel ID: 2b2bd7dc-633f-4fae-b385-24423f0fbb9d 5. cloudflared config:

argo:
    image: cloudflare/cloudflared:
    command: tunnel --no-autoupdate --overwrite-dns --retries 15 --protocol http2 --url http://hedgedoc:3000 --hostname docs.zervice.io --name docs.zervice.io
    volumes:
      # cert.pem in this folder from https://dash.cloudflare.com/argotunnel
      - ./etc/cloudflared:/etc/cloudflared

Environment and versions

OS: Ubuntu 22.04.2
Architecture: amd64
Version: 2023.3.1

Logs and errors

argo_1      | 2023-03-16T05:08:01Z INF Connection 94991c9c-8a41-497e-897a-360734fd564c registered with protocol: http2 connIndex=0 ip=198.41.192.67 location=EWR
argo_1      | 2023-03-16T05:12:07Z INF Lost connection with the edge connIndex=0
argo_1      | 2023-03-16T05:12:07Z WRN Serve tunnel error error="connection with edge closed" connIndex=0 ip=198.41.192.67
argo_1      | 2023-03-16T05:12:07Z INF Unregistered tunnel connection connIndex=0
argo_1      | 2023-03-16T05:12:07Z INF Retrying connection in up to 1s connIndex=0 ip=198.41.192.67
argo_1      | 2023-03-16T05:12:08Z WRN Connection terminated error="connection with edge closed" connIndex=0
argo_1      | 2023-03-16T05:12:09Z INF Connection 3163ea84-9031-4033-8e65-a00e72f17759 registered with protocol: http2 connIndex=0 ip=198.41.192.67 location=EWR
argo_1      | 2023-03-16T05:14:03Z INF Unregistered tunnel connection connIndex=0
argo_1      | 2023-03-16T05:14:03Z INF Lost connection with the edge connIndex=0
argo_1      | 2023-03-16T05:14:03Z WRN Serve tunnel error error="connection with edge closed" connIndex=0 ip=198.41.192.67
argo_1      | 2023-03-16T05:14:03Z INF Retrying connection in up to 1s connIndex=0 ip=198.41.192.67
argo_1      | 2023-03-16T05:14:04Z WRN Connection terminated error="connection with edge closed" connIndex=0
argo_1      | 2023-03-16T05:14:12Z INF Connection 914121fc-04f4-4634-922e-3c464328c9dd registered with protocol: http2 connIndex=0 ip=198.41.192.67 location=EWR
argo_1      | 2023-03-16T05:14:34Z INF Lost connection with the edge connIndex=0
argo_1      | 2023-03-16T05:14:34Z WRN Serve tunnel error error="connection with edge closed" connIndex=0 ip=198.41.192.67
argo_1      | 2023-03-16T05:14:34Z INF Retrying connection in up to 1s connIndex=0 ip=198.41.192.67
argo_1      | 2023-03-16T05:14:34Z INF Unregistered tunnel connection connIndex=0
argo_1      | 2023-03-16T05:14:35Z WRN Connection terminated error="connection with edge closed" connIndex=0
argo_1      | 2023-03-16T05:14:44Z INF Connection 17872c64-60d8-43dc-bdfd-8e2b54a1034a registered with protocol: http2 connIndex=0 ip=198.41.192.67 location=EWR

About this issue

Original URL
State: closed
Created a year ago
Reactions: 2
Comments: 44 (11 by maintainers)

Most upvoted comments

I have had similar experiences. Cloudflare Tunnel has been going worse and worse and I don’t know what I could do about it other than to migrate away from it. Today it has been a nightmare.

I do have 5 servers with tunnels. Only 2 of them are dropping connections currently. One of them is dropping constantly and the other “only” every 15 minutes or so.

Server “A” is tunneling service www.example.com:80 (name changed) which is constantly losing connection. I tried to move tunneling of this service to server “B” (on the same network, sending http requests to the server A). After that the tunnel on the B started to drop connections and the tunnel on the A was good. Then switched back and the A started to drop connections and the B was good. So this issue followed with the service/domain.

I tried to ping some Cloudflare IP addresses and got 0% packet loss. I also can’t find any other problem from my servers except that it just cloudflared which is dropping connections because “Lost connection with the edge” and ‘Connection terminated error=“connection with edge closed”’.

If this is just my server/service configuration, is there any info what I should to do to make cloudflared work better?

This is what happens all the time:

2023-03-16T14:51:58Z ERR Connection terminated error="connection with edge closed" connIndex=2
2023-03-16T14:54:35Z INF Connection dc1d923f-e423-4dd9-8980-458918aad271 registered with protocol: http2 connIndex=1 ip=198.41.200.63 location=HEL
2023-03-16T14:54:35Z INF Connection 0bc9e2bf-b6c1-4c3f-afb0-591089abfac9 registered with protocol: http2 connIndex=0 ip=198.41.192.27 location=FRA
2023-03-16T14:54:35Z INF Connection f78b3bf6-3ed2-4e42-8f8f-a1d26bc97e4e registered with protocol: http2 connIndex=2 ip=198.41.192.167 location=FRA
2023-03-16T14:54:35Z INF Connection 8b7e659d-9825-4055-acf5-b9c77d9f4600 registered with protocol: http2 connIndex=3 ip=198.41.200.113 location=HEL
2023-03-16T14:54:45Z INF Lost connection with the edge connIndex=3
2023-03-16T14:54:45Z WRN Serve tunnel error error="connection with edge closed" connIndex=3 ip=198.41.200.113
2023-03-16T14:54:45Z INF Retrying connection in up to 1s connIndex=3 ip=198.41.200.113
2023-03-16T14:54:45Z INF Unregistered tunnel connection connIndex=3
2023-03-16T14:54:46Z WRN Connection terminated error="connection with edge closed" connIndex=3
2023-03-16T14:55:37Z ERR  error="Incoming request ended abruptly: context canceled" cfRay=7a8dd57ee81a0a44-ARN ingressRule=0 originService=http://www.example.com:80
2023-03-16T14:55:37Z ERR failed to serve incoming request error="Failed to proxy HTTP: Incoming request ended abruptly: context canceled"
2023-03-16T14:55:40Z ERR  error="Incoming request ended abruptly: context canceled" cfRay=7a8dd5eb8f94c7da-TLL ingressRule=0 originService=http://www.example.com:80

Here you can see how often this happens:

2023-03-16T14:00:17Z INF Lost connection with the edge connIndex=1
2023-03-16T14:00:41Z INF Lost connection with the edge connIndex=0
2023-03-16T14:01:08Z INF Lost connection with the edge connIndex=2
2023-03-16T14:02:13Z INF Lost connection with the edge connIndex=0
2023-03-16T14:02:14Z INF Lost connection with the edge connIndex=1
2023-03-16T14:07:06Z INF Lost connection with the edge connIndex=3
2023-03-16T14:08:09Z INF Lost connection with the edge connIndex=2
2023-03-16T14:08:18Z INF Lost connection with the edge connIndex=1
2023-03-16T14:08:48Z INF Lost connection with the edge connIndex=0
2023-03-16T14:08:56Z INF Lost connection with the edge connIndex=3
2023-03-16T14:09:25Z INF Lost connection with the edge connIndex=2
2023-03-16T14:10:32Z INF Lost connection with the edge connIndex=1
2023-03-16T14:10:36Z INF Lost connection with the edge connIndex=0
2023-03-16T14:11:27Z INF Lost connection with the edge connIndex=1
2023-03-16T14:11:42Z INF Lost connection with the edge connIndex=0
2023-03-16T14:12:15Z INF Lost connection with the edge connIndex=3
2023-03-16T14:12:44Z INF Lost connection with the edge connIndex=2
2023-03-16T14:15:47Z INF Lost connection with the edge connIndex=3
2023-03-16T14:17:08Z INF Lost connection with the edge connIndex=3
2023-03-16T14:17:46Z INF Lost connection with the edge connIndex=2
2023-03-16T14:17:52Z INF Lost connection with the edge connIndex=0
2023-03-16T14:18:13Z INF Lost connection with the edge connIndex=3
2023-03-16T14:19:22Z INF Lost connection with the edge connIndex=3
2023-03-16T14:20:29Z INF Lost connection with the edge connIndex=2
2023-03-16T14:20:35Z INF Lost connection with the edge connIndex=0
2023-03-16T14:21:40Z INF Lost connection with the edge connIndex=1
2023-03-16T14:22:38Z INF Lost connection with the edge connIndex=0
2023-03-16T14:23:10Z INF Lost connection with the edge connIndex=2
2023-03-16T14:26:10Z INF Lost connection with the edge connIndex=3
2023-03-16T14:27:12Z INF Lost connection with the edge connIndex=2
2023-03-16T14:29:34Z INF Lost connection with the edge connIndex=1
2023-03-16T14:30:12Z INF Lost connection with the edge connIndex=0
...

terabitti on Mar 16, 2023

Hi @pirate .

Sorry for the delay, but we had other priorities and we were not able to release the change before. The release of the change we made to try and mitigate this was done today and we see a decrease of disconnects when using http2.

can you validate from your side if it improved?

Thanks.

joliveirinha on Apr 19, 2023

Hi @pirate. Thanks for the logs. I spent some time reviewing the logs last Friday, but I couldn’t find a particular reason for the disconnects.

We can indeed validate that the only Goaway frame processed is from our edge to cloudflared and it is sent without an “error”. This is the close of the connection we trigger because we detected that the connection was not able to process any requests anymore.

Looking deep into the golang http2 code, we have a possible root cause that could be triggering this. However, we will only be able to start a edge release next week and validate if this fixes the problem.

Once validate this we will write here about next steps.

Again, thanks for the logs. It was helpful to focus in one direction.

joliveirinha on Mar 29, 2023

Confirming this issue is persisting for us today as well, our tunnels are dropping today constantly and the only thing that works is restarting cloudflared in a loop every hour.

pirate on Mar 16, 2023

Additionally @pirate , if you are able to reproduce this in low traffic tunnels, could you enable debug logging in cloudflared as well as http2 debugging to help us identify what is causing your connection to fail?

We already validated in our side that these connections go into a bad state and that is why they are closed.

So, if you could setup a tunnel where you can reproduce this and gives us the logs, that would be ideal:

The command to run would be something like this: GODEBUG=http2debug=2 cloudflared tunnel --loglevel debug --ha-connections 1 --protocol http2 --url localhost:8090

Again, ideally you would do this in a low traffic tunnel. Thanks.

joliveirinha on Mar 23, 2023

We are also experiencing this with docker based tunnels running http2. We are on 2023.3.1. It started occurring on 17 March when were were on 2023.1.0

jonseymour on Mar 20, 2023

Hi I have 2 tunnels running with version 2023.2.2 Default protocol: auto So 2 days ago, I had to change the protocol to quic for a tunnel. Within 2 days I monitoring then I look unstable connections to the tunnels This is monitoring for the tunnel to changed the protocol to quic

Screenshot 2023-03-23 at 10 29 27

This is monitoring for the tunnel with the protocol to auto Screenshot 2023-03-23 at 10 29 59

Any update this problem? Thanks all

duyhenryer on Mar 23, 2023

I have also found relief by upgrading to QUIC.

One tip for doing this in AWS infrastructure is that the AWS NAT gateways appear to rewrite the source address of inbound QUIC packets so that they appear to come from the NAT Gateway rather than the cloudflare edge servers. As a result, I had to add additional ACL ingress rules to cope with this unexpected occurrence. I have no idea why the NAT Gateways do this but it seems they do (our tunnel servers are in a different subnet to the NAT gateway)

jonseymour on Mar 21, 2023

QUIC is much more stable for me than HTTP2; the tunnels go down every few days. But HTTP2 is much faster in terms of latency and bandwidth (between 5-10x), despite failing almost daily. This is what an HTTP2 tunnels failure looks like:

2023-03-20T13:29:17Z WRN Connection terminated error="DialContext error: dial tcp 198.41.200.13:7844: i/o timeout" connIndex=2
2023-03-20T13:29:34Z WRN Connection terminated error="DialContext error: dial tcp 198.41.192.227:7844: i/o timeout" connIndex=0
2023-03-20T13:29:40Z WRN Connection terminated error="DialContext error: dial tcp 198.41.192.167:7844: i/o timeout" connIndex=3
2023-03-20T13:29:59Z INF Lost connection with the edge connIndex=1
2023-03-20T13:29:59Z INF Unregistered tunnel connection connIndex=1
2023-03-20T13:29:59Z WRN Serve tunnel error error="connection with edge closed" connIndex=1 ip=198.41.200.43
2023-03-20T13:29:59Z INF Retrying connection in up to 1s connIndex=1 ip=198.41.200.43
2023-03-20T13:30:00Z ERR Connection terminated error="connection with edge closed" connIndex=1

realies on Mar 20, 2023

Before switching to QUIC I was getting ”Lost connection with the edge” about every 2 minutes. After switching to QUIC I have seen ”timeout: no recent network activity” only a few times in 6-7 hours. That’s a huge change and am not worried about those infrequent errors.

I still have one http2 tunnel (not migrated to QUIC yet) throwing ”Lost connection with the edge” about 10 times per hour (increased from yesterday).

It wasn’t very straightforward to get QUIC working on our end (had to change networks/gateways), but fortunately it was possible. I have to migrate other servers to this too.

terabitti on Mar 17, 2023