traefik: Support override of HTTP 404 when service does not exist or is down

What version of Traefik are you using (traefik version)?

Version:      dev
Codename:     cheddar
Go version:   go1.7.4
Built:        I don't remember exactly
OS/Arch:      darwin/amd64

Request

Quite often sending HTTP 404 when a service is down (i.e. consul health check fails) is not the correct behavior (for both browsers or clients of APIs) and we’d much rather send HTTP 502 or HTTP 503.

I’m running a fork right now that overrides the NotFoundHandler for the default mux - but it would be nice if this becomes a configurable option, perhaps even with a go template.

    func badGatewayHandler(w http.ResponseWriter, r *http.Request) {
        w.WriteHeader(http.StatusBadGateway)
        fmt.Fprintf(w, "502 bad gateway")
    }

    func (server *Server) buildDefaultHTTPRouter() *mux.Router {
        router := mux.NewRouter()
        // router.NotFoundHandler = http.HandlerFunc(notFoundHandler)
        router.NotFoundHandler = http.HandlerFunc(badGatewayHandler)
        router.StrictSlash(true)
        router.SkipClean(true)
        return router
    }

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 5
  • Comments: 32 (7 by maintainers)

Most upvoted comments

As this comes up pretty easily for searches about not wanting the default 404 for missing/down services, I thought I’d drop a note here. We do dynamic traefik configuration in docker swarm and the tree of issues/PRs that relate to this issue were, as of current and as far as we can tell, not solving our issue. We needed traefik to respond with a 50x instead of 404 when a service was unavailable in a swarm regardless if it was not registered with traefik. Returning 404 when a service isn’t present on a swarm feels wrong – in our case it caused the client side not to retry against other healthy swarms.

This is hacky… but we ended up deploying a web service that returns http 503 on any request. We then registered this service with a catchall rule and a lower priority. In our testing, this works pretty well.

traefik.frontend.rule=HostRegexp:{catchall:.*}
traefik.frontend.priority=1

@dtomcej there is a difference between page not found and lack of service. RFC2616 (https://www.ietf.org/rfc/rfc2616.txt) describe when HTTP 502 should be used:

10.5.3 502 Bad Gateway

   The server, while acting as a gateway or proxy, received an invalid
   response from the upstream server it accessed in attempting to
   fulfill the request.

Lack of service or entire service down is not the same as page not found.

I second the configurable HTTP status code for non-reachable backends, on a per-frontend-configurable basis and/or at least a globally default. Problem with the general 404 is that it confuses bots and also users. 502/503 is the better error code in most situations.

To me it looks like we have two different, desirable behavior changes here:

  1. Return a configurable error code (e.g., something from the 50[234] range) if either a target frontend or its backends are missing.
  2. Return a configurable error code if the backends to a known frontend are missing, and a 404 if the frontend is unknown.

Would that be a correct observation?

Personally, I am also worried about negative caching effects that may occur in a dynamic, microservices-driven environment where frontends can easily (but inadvertently) disappear for short periods of time. Hence, approach 1. (which the current in-flight PR implements AFAICS) seems to be the safer one from this perspective. I’m open for discussion and possible re-adjustment, though.

Even if there is no service for a given name, I’d rather have it respond with 502 than 404.

In general a 404 is not an error, while a backend that should be there but isn’t clearly is an error. With static configuration that’s easy: Once configured, it knows what should be there. That doesn’t work with traefik. So instead I’d like to have a option to always return 502 (think proxy vs webserver mode) if the request doesn’t match a service.

Now that I think about this more (so far using traefik only for pet projects), this would be a hard requirement for me to introduce traefik to production.

I absolutely agree with @discordianfish and @grobinson-blockchain. This not a client error but a server error. Why shouldn’t it be possible to return 502 if there is no route to a backend? This would be an indicator for a missing backend and could be used to differentiate 404 and 503 errors. Additionally, this will break caching in Varnish and other caches or reverse proxies. IMHO, this is a major blocker for production usage.

The other issue is that 404 is a client error code, not a server error. However, if a service doesn’t exist, because it is down, this is not a client error.

Seems like the err and exists should be split into two different if blocks.

https://github.com/containous/traefik/blob/master/provider/kubernetes.go#L191

If err != nil traefik should probably keep the current state and if !exists it should probably do what it does now, remove the current service.

That’s not true. Traefik maintains a cached state of the last good configuration. All I’m saying is it should continue to use the last good configuration when it encounters an API error during sync process. Right now it just logs the error and throws it away, and then acts like the sync succeeded and swaps out the last known config with a corrupt config. If it were simply to return errors here:

https://github.com/containous/traefik/blob/master/provider/kubernetes.go#L194

Instead of a continue statement, then it wouldn’t override the config here:

https://github.com/containous/traefik/blob/master/provider/kubernetes.go#L88

And take down the entire router because of a temporary API error. The current behavior can’t possibly be desired by anyone, certainly it’s enough to make us stop using traefik if it’s expected for it to just competely fall over because it fails to make an api call every once and a while.

No, in the case you described, nginx is NOT stateless. It is aware that app SHOULD exist, even without there being any servers in its pool.

As stated earlier, if you manually create a Frontend/Backend as per https://docs.traefik.io/toml/#file-backend.

If you populated a backend with no servers (or fictitious servers that didn’t exist), then you should have the same behaviour.

You can also use multiple providers at the same time (you can have consul handle some things, and the file provider handle others).

When a service does not exist, the correct response is 404. If you dynamically generate and configure services, it can be tempting to want to return a 50x.

Unfortunately, as far as the proxy is concerned, it is designed to be stateless, relying on providers for configuration. If your provider does not respond, or has missing services, traefik has no way to tell if that was legitimately deleted, or is missing, and therefore has to treat it as the former.

If you are having issues with your provider not passing health checks etc, you may want to look at providing a frontend/backend through a different provider (such as the toml), that can provide the persistence you are looking for.