cloudflared: Many http/2 tunnels on different servers are getting repeatedly disconnected `Connection terminated error="connection with edge closed"`
Describe the bug
We’re a software consultancy that uses many different hosting providers and platforms for a wide variety of client projects (AWS, DigitalOcean, Vultr, etc.). We use cloudflared tunnels for ingress for many of our apps, with probably 30+ tunnels running at any given time on servers around the world.
On March 9th, 2023 all of our Cloudflare tunnels disconnected around the same time. According to the Cloudflare admin dashboard audit log (see screenshot below), all of the tunnels were de-registered due to “origin server went away” or “opened elsewhere”, and then re-registered. Most came up successfully after the outage on their own, but a few never successfully reconnected until we SSH’d in to restart the cloudflared container. They all showed lots of this error in console:
Connection terminated error="connection with edge closed" (which may be a red herring, or may be related, there are so many of these lines it’s hard to tell if they’re a cause or a symptom)
No one had changed anything in our Cloudflare config on March 9th (or even logged into the admin dash prior to the incident), and no one had logged into the servers affected via SSH either, so we’re 99% sure we didn’t do anything to trigger this. It also occurred across multiple hosting providers, so we don’t think it was an outage on DigitalOcean or any one of our VPS providers.
…
On March 15th (edit: and again on March 16th), one of these tunnels randomly started experiencing this error again, causing a critical service to go down for several hours. Nothing has changed on this server during this time, and restarting it fixed it with seemingly no underlying cause and no other applications on the server affected. The server has plenty of memory, cpu, and bandwidth, and none of the other things on the server experienced packet loss when the tunnel went down.
We’re worried this will keep happening and we’ll have to move off Cloudflare entirely, but would really like to avoid that as tunnels are quite nice for one-line docker-based deployment without needing config files/admin dash setup/persistent state/etc.
To Reproduce Steps to reproduce the behavior:
- Run
cloudflared/cloudflared:2023.3.1in docker on a DigitalOcean VPS - Wait ?
- Tunnel randomly disconnects, restarting the tunnel container fixes it instantly
If it’s an issue with Cloudflare Tunnel:
4. Tunnel ID: 2b2bd7dc-633f-4fae-b385-24423f0fbb9d
5. cloudflared config:
argo:
image: cloudflare/cloudflared:
command: tunnel --no-autoupdate --overwrite-dns --retries 15 --protocol http2 --url http://hedgedoc:3000 --hostname docs.zervice.io --name docs.zervice.io
volumes:
# cert.pem in this folder from https://dash.cloudflare.com/argotunnel
- ./etc/cloudflared:/etc/cloudflared
Environment and versions
- OS:
Ubuntu 22.04.2 - Architecture:
amd64 - Version:
2023.3.1
Logs and errors
argo_1 | 2023-03-16T05:08:01Z INF Connection 94991c9c-8a41-497e-897a-360734fd564c registered with protocol: http2 connIndex=0 ip=198.41.192.67 location=EWR
argo_1 | 2023-03-16T05:12:07Z INF Lost connection with the edge connIndex=0
argo_1 | 2023-03-16T05:12:07Z WRN Serve tunnel error error="connection with edge closed" connIndex=0 ip=198.41.192.67
argo_1 | 2023-03-16T05:12:07Z INF Unregistered tunnel connection connIndex=0
argo_1 | 2023-03-16T05:12:07Z INF Retrying connection in up to 1s connIndex=0 ip=198.41.192.67
argo_1 | 2023-03-16T05:12:08Z WRN Connection terminated error="connection with edge closed" connIndex=0
argo_1 | 2023-03-16T05:12:09Z INF Connection 3163ea84-9031-4033-8e65-a00e72f17759 registered with protocol: http2 connIndex=0 ip=198.41.192.67 location=EWR
argo_1 | 2023-03-16T05:14:03Z INF Unregistered tunnel connection connIndex=0
argo_1 | 2023-03-16T05:14:03Z INF Lost connection with the edge connIndex=0
argo_1 | 2023-03-16T05:14:03Z WRN Serve tunnel error error="connection with edge closed" connIndex=0 ip=198.41.192.67
argo_1 | 2023-03-16T05:14:03Z INF Retrying connection in up to 1s connIndex=0 ip=198.41.192.67
argo_1 | 2023-03-16T05:14:04Z WRN Connection terminated error="connection with edge closed" connIndex=0
argo_1 | 2023-03-16T05:14:12Z INF Connection 914121fc-04f4-4634-922e-3c464328c9dd registered with protocol: http2 connIndex=0 ip=198.41.192.67 location=EWR
argo_1 | 2023-03-16T05:14:34Z INF Lost connection with the edge connIndex=0
argo_1 | 2023-03-16T05:14:34Z WRN Serve tunnel error error="connection with edge closed" connIndex=0 ip=198.41.192.67
argo_1 | 2023-03-16T05:14:34Z INF Retrying connection in up to 1s connIndex=0 ip=198.41.192.67
argo_1 | 2023-03-16T05:14:34Z INF Unregistered tunnel connection connIndex=0
argo_1 | 2023-03-16T05:14:35Z WRN Connection terminated error="connection with edge closed" connIndex=0
argo_1 | 2023-03-16T05:14:44Z INF Connection 17872c64-60d8-43dc-bdfd-8e2b54a1034a registered with protocol: http2 connIndex=0 ip=198.41.192.67 location=EWR
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 2
- Comments: 44 (11 by maintainers)
I have had similar experiences. Cloudflare Tunnel has been going worse and worse and I don’t know what I could do about it other than to migrate away from it. Today it has been a nightmare.
I do have 5 servers with tunnels. Only 2 of them are dropping connections currently. One of them is dropping constantly and the other “only” every 15 minutes or so.
Server “A” is tunneling service www.example.com:80 (name changed) which is constantly losing connection. I tried to move tunneling of this service to server “B” (on the same network, sending http requests to the server A). After that the tunnel on the B started to drop connections and the tunnel on the A was good. Then switched back and the A started to drop connections and the B was good. So this issue followed with the service/domain.
I tried to ping some Cloudflare IP addresses and got 0% packet loss. I also can’t find any other problem from my servers except that it just cloudflared which is dropping connections because “Lost connection with the edge” and ‘Connection terminated error=“connection with edge closed”’.
If this is just my server/service configuration, is there any info what I should to do to make cloudflared work better?
This is what happens all the time:
Here you can see how often this happens:
Hi @pirate .
Sorry for the delay, but we had other priorities and we were not able to release the change before. The release of the change we made to try and mitigate this was done today and we see a decrease of disconnects when using http2.
can you validate from your side if it improved?
Thanks.
Hi @pirate. Thanks for the logs. I spent some time reviewing the logs last Friday, but I couldn’t find a particular reason for the disconnects.
We can indeed validate that the only Goaway frame processed is from our edge to cloudflared and it is sent without an “error”. This is the close of the connection we trigger because we detected that the connection was not able to process any requests anymore.
Looking deep into the golang http2 code, we have a possible root cause that could be triggering this. However, we will only be able to start a edge release next week and validate if this fixes the problem.
Once validate this we will write here about next steps.
Again, thanks for the logs. It was helpful to focus in one direction.
Confirming this issue is persisting for us today as well, our tunnels are dropping today constantly and the only thing that works is restarting cloudflared in a loop every hour.
Additionally @pirate , if you are able to reproduce this in low traffic tunnels, could you enable debug logging in cloudflared as well as http2 debugging to help us identify what is causing your connection to fail?
We already validated in our side that these connections go into a bad state and that is why they are closed.
So, if you could setup a tunnel where you can reproduce this and gives us the logs, that would be ideal:
The command to run would be something like this: GODEBUG=http2debug=2 cloudflared tunnel --loglevel debug --ha-connections 1 --protocol http2 --url localhost:8090
Again, ideally you would do this in a low traffic tunnel. Thanks.
We are also experiencing this with docker based tunnels running http2. We are on 2023.3.1. It started occurring on 17 March when were were on 2023.1.0
Hi I have 2 tunnels running with version
2023.2.2Default protocol:autoSo 2 days ago, I had to change the protocol toquicfor a tunnel. Within 2 days I monitoring then I look unstable connections to the tunnels This is monitoring for the tunnel to changed the protocol toquicThis is monitoring for the tunnel with the protocol to
autoAny update this problem? Thanks all
I have also found relief by upgrading to QUIC.
One tip for doing this in AWS infrastructure is that the AWS NAT gateways appear to rewrite the source address of inbound QUIC packets so that they appear to come from the NAT Gateway rather than the cloudflare edge servers. As a result, I had to add additional ACL ingress rules to cope with this unexpected occurrence. I have no idea why the NAT Gateways do this but it seems they do (our tunnel servers are in a different subnet to the NAT gateway)
QUIC is much more stable for me than HTTP2; the tunnels go down every few days. But HTTP2 is much faster in terms of latency and bandwidth (between 5-10x), despite failing almost daily. This is what an HTTP2 tunnels failure looks like:
Before switching to QUIC I was getting ”Lost connection with the edge” about every 2 minutes. After switching to QUIC I have seen ”timeout: no recent network activity” only a few times in 6-7 hours. That’s a huge change and am not worried about those infrequent errors.
I still have one http2 tunnel (not migrated to QUIC yet) throwing ”Lost connection with the edge” about 10 times per hour (increased from yesterday).
It wasn’t very straightforward to get QUIC working on our end (had to change networks/gateways), but fortunately it was possible. I have to migrate other servers to this too.