traefik: Random 500 Error returned by Traefik

Report a bug

What version of Traefik are you using (traefik version)?

v1.3.0

What did you do?

We are using Traefik as a load balancer using Marathon provider in our Production

Below is our setup

Edge Proxy - Haproxy Internal proxy Layer - Traefik We use Haproxy as our edge proxy due to complex rewrites/ redirects and the backends are configured configured with Traefik servers. Also all our internal service to service communication happens via Traefik.

Here is our traffic flow

API Call --> Haproxy --> Traefik Servers --> Docker containers

Initially when we introduced Traefik in Prod, we just switched the inter service communication that used to happen via internal haproxy load balancers, we did not see any issues. So we decided to move the Edge Proxy Traffic via Traefik instead of sending directly to app containers. We made this change only to send 10% of our traffic to observe before we switch all traffic.

So out of 3 Edge Haproxy Servers, only one server sends traffic via Traefik and other 2 servers send directly to app containers.

What did you expect to see?

No Change in Behavior

What did you see instead?

After we made this change, we started observing random 500 Errors returned by the Haproxy server sending Traffic via Traefik server. But we dont see that issue on Proxy servers for similar api calls. We dont the see those failed requests reaching the Backend app containers as we dont see any events in the logs during this time

Log Event from Haproxy

Aug 6 22:56:52 localhost.localdomain haproxy[118832]: ::ffff:12:23:234:12:35449 [06/Aug/2017:22:56:51.741] HTTPS~ appname/traefik-80 345/0/0/231/576 500 157 - - ---- 35/34/0/0/0 0/0 {api.abc.cloud||||okhttp/3.4.2} "GET /v1.1/abc/c45f21be84824984b34c8538bd8b519a/abc?offset=1000&count=100 HTTP/1.1"

Log Event from Traefik

101.110.24.120 - - [06/Aug/2017:22:56:52 -0700] "GET /api/abc/c45f21be84824984b34c8538bd8b519a/abc HTTP/1.1" 500 21 "" "okhttp/3.4.2" 44609802 "appname" "http://hostname:31020" 229ms

What is your environment & configuration (arguments, toml, provider, platform, …)?

configuration
################################################################
# Global configuration
################################################################

# Timeout in seconds.
# Duration to give active requests a chance to finish during hot-reloads
#
# Optional
# Default: 10
#
graceTimeOut = 10

# Traefik logs file
# If not defined, logs to stdout
#
# Optional
#
traefikLogsFile = "/var/log/traefik/traefik.log"


# Access logs file
#
# Optional
#
accessLogsFile = "/var/log/traefik/traefik-access.log"

# Log level
#
# Optional
# Default: "ERROR"
#
logLevel = "DEBUG"

# Backends throttle duration: minimum duration between 2 events from providers
# before applying a new configuration. It avoids unnecessary reloads if multiples events
# are sent in a short amount of time.
#
# Optional
# Default: "2s"
#
ProvidersThrottleDuration = 1


# If non-zero, controls the maximum idle (keep-alive) to keep per-host.  If zero, DefaultMaxIdleConnsPerHost is used.
# If you encounter 'too many open files' errors, you can either change this value, or change `ulimit` value.
#
# Optional
# Default: http.DefaultMaxIdleConnsPerHost
#
MaxIdleConnsPerHost = 200

# If set to true invalid SSL certificates are accepted for backends.
# Note: This disables detection of man-in-the-middle attacks so should only be used on secure backend networks.
# Optional
# Default: false
#
InsecureSkipVerify = false

# Entrypoints to be used by frontends that do not specify any entrypoint.
# Each frontend can specify its own entrypoints.
#
# Optional
# Default: ["http"]
#
defaultEntryPoints = ["http"]

################################################################
# Web configuration backend
################################################################

[web]
address = ":8080"

################################################################
# Mesos/Marathon configuration backend
################################################################

# Enable Marathon configuration backend
#
# Optional
#
#[marathon]
[marathon]


# Marathon server endpoint.
# You can also specify multiple endpoint for Marathon:
# endpoint := "http://10.241.1.71:8080,10.241.1.72:8080,10.241.1.73:8080"
#
# Required
#
endpoint = "http://101.211.1.11:8080,101.211.1.13:8080,101.211.1.13:8080"

# Enable watch Marathon changes
#
# Optional
#
watch = true

# Default domain used.
#
# Required
#
domain = "traefik.service.consul"

# Override default configuration template. For advanced users :)
#
# Optional
#
# filename = "marathon.tmpl"
filename = "/opt/traefik/conf/marathon.tmpl"

# Expose Marathon apps by default in traefik
#
# Optional
# Default: false
#
exposedByDefault = false


# Convert Marathon groups to subdomains
# Default behavior: /foo/bar/myapp => foo-bar-myapp.{defaultDomain}
# with groupsAsSubDomains enabled: /foo/bar/myapp => myapp.bar.foo.{defaultDomain}
#
# Optional
# Default: false
#
groupsAsSubDomains = false

# Enable Marathon basic authentication
#
# Optional
#
[marathon.basic]
httpBasicAuthUser = "xxx"
httpBasicPassword = "xxxxxxx"

# To enable more detailed statistics
[web.statistics]
   RecentErrors = 10
# To enable Traefik to export internal metrics to Prometheus
[web.metrics.prometheus]
   Buckets=[0.1,0.3,1.2,5.0]

# TLS client configuration. https://golang.org/pkg/crypto/tls/#Config
#
# Optional
#
# [marathon.TLS]
# InsecureSkipVerify = true


[retry]

We use Traefik Healthcheck and below is the Marathon Labels for one of the service where we see the failure

  "labels": {
    "consul": "servicename-5654",
    "external": "tag",
    "internal": "tag",
    "prod": "tag",
    "5654": "tag",
    "traefik.backend.healthcheck.path": "/healthcheck",
    "traefik.backend.healthcheck.interval": "10s",
    "traefik.frontend.rule": "PathPrefixStrip:/servicename",
    "traefik.backend": "-servicename",
    "traefik.portIndex": "0",
    "traefik.enable": "true"
  },

We are not able to identify a pattern. We were running Traefik serveres in C4.xlarge and changed the Instance Type to c4.2xlarge to check if any network bandwidth/ resource issue. Appreciate if your help in troubleshooting this issue.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 3
  • Comments: 18 (8 by maintainers)

Most upvoted comments

We’re experiencing the same issue: requests where the client disappears before a response could be sent are logged as errors.

The problem is that it’s really easy to reproduce. You can just load a site in your browser that takes 1s+ to load and hit reload multiple times before the page is fully rendered to “reissue” the request. The client now disappeared for the first requests and traefik logs a 500.

I think it’s debatable whether this should be reported as a status 500 error. IMHO it’s not an error. But even if we classify it as an error, I don’t think it should be reported as an “Internal Server Error”.

Our alerting system (based on grafana + prometheus) goes nuts if users are impatient and hit reload multiple times.