caddy: lb_policy header fails while an upstream is still available
caddy: 2.5.1
I’ve configured caddy to do header based loadbalancing with two gRPC backend servers(HTTP2). This works perfectly fine as long as both backends are up. If i turn one of them off, i start to get HTTP 502 errors:
{
"level":"error",
"ts":1653399173.2777977,
"logger":"http.log.error.log15",
"msg":"no upstreams available",
"request":{
"remote_ip":"xxxxxxxxxxxxx",
"remote_port":"xxxx",
"proto":"HTTP/2.0",
"method":"POST",
"host":"xxxxxxxxxxxx",
"uri":"xxxxxxxxxxxxx",
"headers":{
"X-Customer-Id":["xxxxxxxxxx"]
....
},
"tls":{
"resumed":false,
"version":772,
"cipher_suite":4865,
"proto":"h2",
"server_name":"xxxxxxxxxxxxx"
}
},
"duration":0.000044557,
"status":502,
"err_id":"easyavvap",
"err_trace":"reverseproxy.statusError (reverseproxy.go:1196)"
}
Relevant excerpt from the Caddyfile:
(header_lb) {
header_up Host {upstream_hostport}
header_down X-Backend-Server {upstream_hostport}
lb_policy header X-Customer-Id
lb_try_duration 2s
fail_duration 1m
unhealthy_status 5xx
transport http {
versions 2
}
}
https://example.com {
log {
output stdout
format console
level WARN
}
tls {
load /etc/caddy/certs
}
reverse_proxy https://backend1.com https://backend2.com {
import header_lb
}
}
According to the docs this should be enough to detect a broken backend. I would expect that caddy directs all requests to the remaining backend.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 29 (14 by maintainers)
The initial problems occurred while the header was present, but I was able to mimic the same results from my browser without the header. When I tested back then, current policy didn’t seem to matter.
Thanks for the investigation. I will update to the latest version and see if I can still reproduce the problem on my end.
@mholt you can test with our backends:
Caddyfile
Opening e.g https://geo-osm-01.ot-hosting.de:8385/ in the browser will yield a HTTP 415 as it’s a gRPC endpoint but that should still be enough for testing. You should be able to reproduce the HTTP 500 by adding a bogus backend.
I think we’ll just need to do some local debugging to try to trace it, but Matt or I haven’t had time to look into it yet.
Same problem. It might be a coincidence but things seem to work if caddy receives an HTTP 5xx. If the backend is completely down things seem to break.