caddy: lb_policy header fails while an upstream is still available

caddy: 2.5.1

I’ve configured caddy to do header based loadbalancing with two gRPC backend servers(HTTP2). This works perfectly fine as long as both backends are up. If i turn one of them off, i start to get HTTP 502 errors:

{
   "level":"error",
   "ts":1653399173.2777977,
   "logger":"http.log.error.log15",
   "msg":"no upstreams available",
   "request":{
      "remote_ip":"xxxxxxxxxxxxx",
      "remote_port":"xxxx",
      "proto":"HTTP/2.0",
      "method":"POST",
      "host":"xxxxxxxxxxxx",
      "uri":"xxxxxxxxxxxxx",
      "headers":{
          "X-Customer-Id":["xxxxxxxxxx"]
          ....
      },
      "tls":{
         "resumed":false,
         "version":772,
         "cipher_suite":4865,
         "proto":"h2",
         "server_name":"xxxxxxxxxxxxx"
      }
   },
   "duration":0.000044557,
   "status":502,
   "err_id":"easyavvap",
   "err_trace":"reverseproxy.statusError (reverseproxy.go:1196)"
}

Relevant excerpt from the Caddyfile:

(header_lb) {
	header_up Host {upstream_hostport}
	header_down X-Backend-Server {upstream_hostport}
	lb_policy header X-Customer-Id
	lb_try_duration 2s
	fail_duration 1m
	unhealthy_status 5xx

	transport http {
		versions 2
	}
}

https://example.com {
	log {
		output stdout
		format console
		level WARN
	}

	tls {
		load /etc/caddy/certs
	}

	reverse_proxy https://backend1.com https://backend2.com {
		import header_lb
	}
}

According to the docs this should be enough to detect a broken backend. I would expect that caddy directs all requests to the remaining backend.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 29 (14 by maintainers)

Most upvoted comments

omitting the X-Customer-Id header in the request

The initial problems occurred while the header was present, but I was able to mimic the same results from my browser without the header. When I tested back then, current policy didn’t seem to matter.

Thanks for the investigation. I will update to the latest version and see if I can still reproduce the problem on my end.

@mholt you can test with our backends:

Caddyfile
(common) {
	log {
		output stdout
		format console
		level WARN
	}
	log {
		output file /var/log/caddy/{args.0}.log {
			roll_size 10MiB
			roll_keep 20
			roll_keep_for 7d
		}
		format console
	}
}

(manual) {
	# Manual certificates and keys
	############################################################
	# Load all certificates and keys from .pem files found in this folder
	tls {
		load /etc/caddy/certs
	}
	############################################################
}

(header_lb) {
	header_up Host {upstream_hostport}
	header_down X-Backend-Server {upstream_hostport}
	lb_policy header X-Optitool-Installation
	lb_try_duration 5s
	fail_duration 1m
	unhealthy_status 5xx

	transport http {
		versions 2
	}
}

https://routing.ot-hosting.de {
	import common routingserver
	import manual
	reverse_proxy https://geo-osm-01.ot-hosting.de:8385 https://geo-osm-02.ot-hosting.de:8385 https://geo-osm-03.ot-hosting.de:8385 {
		import header_lb
	}
}

Opening e.g https://geo-osm-01.ot-hosting.de:8385/ in the browser will yield a HTTP 415 as it’s a gRPC endpoint but that should still be enough for testing. You should be able to reproduce the HTTP 500 by adding a bogus backend.

I think we’ll just need to do some local debugging to try to trace it, but Matt or I haven’t had time to look into it yet.

Same problem. It might be a coincidence but things seem to work if caddy receives an HTTP 5xx. If the backend is completely down things seem to break.