caddy: Suggestion - Raise the default timeout in Caddy's graceful shutdown

1. What version of Caddy are you using (caddy -version)?

Caddy 0.10.4

2. What are you trying to do?

Use Caddy as a Kubernetes ingress controller & load balancer

3. What is your entire Caddyfile?


localhost:81 {
  status 200 /status
}

:12015 {
  status 200 /healthz
}

      myhost.mywebsite.com/ {
        log / stdout "{combined} {upstream}"
        tls {$ACME_EMAIL} {
          # https://github.com/mholt/caddy/issues/189
          alpn http/1.1
        }
        proxy / {
          policy round_robin
          header_upstream -X-Forwarded-For
          keepalive 0

          upstream 10.48.75.4:8080
          upstream 10.48.76.3:8080
          upstream 10.48.72.4:8080
          upstream 10.48.70.3:8080
          upstream 10.48.49.4:8080
          upstream 10.48.73.3:8080
          upstream 10.48.71.3:8080
        }
      }

4. How did you run Caddy (give the full command and describe the execution environment)?

Running via https://github.com/wehco/caddy-ingress-controller

/usr/bin/caddy -conf /etc/Caddyfile -log stdout

5. Please paste any relevant HTTP request(s) here.

68.xx..xx.xx - - [28/Jul/2017:15:58:20 +0000] "GET /redacted HTTP/1.1" 502 16 "-" "http-kit/2.0" http://10.48.66.3:8080
68.xx..xx.xx - - [28/Jul/2017:15:58:21 +0000] "GET /redacted HTTP/1.1" 200 97 "-" "http-kit/2.0" http://10.48.49.4:8080
68.xx..xx.xx - - [28/Jul/2017:15:58:23 +0000] "GET /redacted HTTP/1.1" 502 16 "-" "http-kit/2.0" http://10.48.56.3:8080
68.xx..xx.xx - - [28/Jul/2017:15:58:33 +0000] "GET /redacted HTTP/1.1" 200 98 "-" "http-kit/2.0" http://10.48.70.3:8080
68.xx..xx.xx - - [28/Jul/2017:15:58:38 +0000] "GET /redacted HTTP/1.1" 502 16 "-" "http-kit/2.0" http://10.48.59.5:8080
68.xx..xx.xx - - [28/Jul/2017:15:58:51 +0000] "GET /redacted HTTP/1.1" 502 16 "-" "http-kit/2.0" http://10.48.68.5:8080
68.xx..xx.xx - - [28/Jul/2017:15:58:53 +0000] "GET /redacted HTTP/1.1" 502 16 "-" "http-kit/2.0" http://10.48.57.3:8080
68.xx..xx.xx - - [28/Jul/2017:15:58:54 +0000] "GET /redacted HTTP/1.1" 200 97 "-" "http-kit/2.0" http://10.48.49.4:8080
68.xx..xx.xx - - [28/Jul/2017:15:58:54 +0000] "GET /redacted HTTP/1.1" 200 97 "-" "http-kit/2.0" http://10.48.49.4:8080
68.xx..xx.xx - - [28/Jul/2017:15:58:54 +0000] "GET /redacted HTTP/1.1" 200 107 "-" "http-kit/2.0" http://10.48.73.3:8080
68.xx..xx.xx - - [28/Jul/2017:15:58:54 +0000] "GET /redacted HTTP/1.1" 200 105 "-" "http-kit/2.0" http://10.48.71.3:8080

6. What did you expect to see?

Caddy should only proxy to upstreams in the current Caddyfile after a SIGUSR1 reload

7. What did you see instead (give full error messages and/or log)?

Caddy proxies to upstreams which were present in previously loaded Caddyfiles, but since removed and reloaded.

8. How can someone who is starting from scratch reproduce the bug as minimally as possible?

Configure Caddyfile to proxy round robin to multiple upstreams, remove and/or add an upstream, and reload the Caddyfile with SIGUSR1. Repeat this a few times. Observe 502s to removed backends.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 19 (10 by maintainers)

Most upvoted comments

I’ve at last resolved this issue. The graceful http server shutdown was timing out, having a default timeout (-grace) of 5s, while the upstream requests it was proxying to were hanging, with a write timeout of 20s. Changing the http server -grace timeout to 30s resolved the issue and the 502s going to ghost upstreams never came back.

@mholt I think it would be a good idea to change the default timeout in Caddy’s graceful shutdown to 30s here: https://github.com/mholt/caddy/blob/master/caddyhttp/httpserver/plugin.go#L27 since it should be at least as long as the default request timeout, in my opinion, to prevent this kind of race/error.

@sundbry Oooo nice catch. Thanks for the detective work! I was beginning to wonder if there were timeouts involved but in what I had tried, I couldn’t replicate long timeouts like that.

Lemme see if we can do something about this automatically… I will close the issue once I’ve looked into it!