traefik: Random 500 Error returned by Traefik
Report a bug
What version of Traefik are you using (traefik version)?
v1.3.0
What did you do?
We are using Traefik as a load balancer using Marathon provider in our Production
Below is our setup
Edge Proxy - Haproxy Internal proxy Layer - Traefik We use Haproxy as our edge proxy due to complex rewrites/ redirects and the backends are configured configured with Traefik servers. Also all our internal service to service communication happens via Traefik.
Here is our traffic flow
API Call --> Haproxy --> Traefik Servers --> Docker containers
Initially when we introduced Traefik in Prod, we just switched the inter service communication that used to happen via internal haproxy load balancers, we did not see any issues. So we decided to move the Edge Proxy Traffic via Traefik instead of sending directly to app containers. We made this change only to send 10% of our traffic to observe before we switch all traffic.
So out of 3 Edge Haproxy Servers, only one server sends traffic via Traefik and other 2 servers send directly to app containers.
What did you expect to see?
No Change in Behavior
What did you see instead?
After we made this change, we started observing random 500 Errors returned by the Haproxy server sending Traffic via Traefik server. But we dont see that issue on Proxy servers for similar api calls. We dont the see those failed requests reaching the Backend app containers as we dont see any events in the logs during this time
Log Event from Haproxy
Aug 6 22:56:52 localhost.localdomain haproxy[118832]: ::ffff:12:23:234:12:35449 [06/Aug/2017:22:56:51.741] HTTPS~ appname/traefik-80 345/0/0/231/576 500 157 - - ---- 35/34/0/0/0 0/0 {api.abc.cloud||||okhttp/3.4.2} "GET /v1.1/abc/c45f21be84824984b34c8538bd8b519a/abc?offset=1000&count=100 HTTP/1.1"
Log Event from Traefik
101.110.24.120 - - [06/Aug/2017:22:56:52 -0700] "GET /api/abc/c45f21be84824984b34c8538bd8b519a/abc HTTP/1.1" 500 21 "" "okhttp/3.4.2" 44609802 "appname" "http://hostname:31020" 229ms
What is your environment & configuration (arguments, toml, provider, platform, …)?
configuration
################################################################
# Global configuration
################################################################
# Timeout in seconds.
# Duration to give active requests a chance to finish during hot-reloads
#
# Optional
# Default: 10
#
graceTimeOut = 10
# Traefik logs file
# If not defined, logs to stdout
#
# Optional
#
traefikLogsFile = "/var/log/traefik/traefik.log"
# Access logs file
#
# Optional
#
accessLogsFile = "/var/log/traefik/traefik-access.log"
# Log level
#
# Optional
# Default: "ERROR"
#
logLevel = "DEBUG"
# Backends throttle duration: minimum duration between 2 events from providers
# before applying a new configuration. It avoids unnecessary reloads if multiples events
# are sent in a short amount of time.
#
# Optional
# Default: "2s"
#
ProvidersThrottleDuration = 1
# If non-zero, controls the maximum idle (keep-alive) to keep per-host. If zero, DefaultMaxIdleConnsPerHost is used.
# If you encounter 'too many open files' errors, you can either change this value, or change `ulimit` value.
#
# Optional
# Default: http.DefaultMaxIdleConnsPerHost
#
MaxIdleConnsPerHost = 200
# If set to true invalid SSL certificates are accepted for backends.
# Note: This disables detection of man-in-the-middle attacks so should only be used on secure backend networks.
# Optional
# Default: false
#
InsecureSkipVerify = false
# Entrypoints to be used by frontends that do not specify any entrypoint.
# Each frontend can specify its own entrypoints.
#
# Optional
# Default: ["http"]
#
defaultEntryPoints = ["http"]
################################################################
# Web configuration backend
################################################################
[web]
address = ":8080"
################################################################
# Mesos/Marathon configuration backend
################################################################
# Enable Marathon configuration backend
#
# Optional
#
#[marathon]
[marathon]
# Marathon server endpoint.
# You can also specify multiple endpoint for Marathon:
# endpoint := "http://10.241.1.71:8080,10.241.1.72:8080,10.241.1.73:8080"
#
# Required
#
endpoint = "http://101.211.1.11:8080,101.211.1.13:8080,101.211.1.13:8080"
# Enable watch Marathon changes
#
# Optional
#
watch = true
# Default domain used.
#
# Required
#
domain = "traefik.service.consul"
# Override default configuration template. For advanced users :)
#
# Optional
#
# filename = "marathon.tmpl"
filename = "/opt/traefik/conf/marathon.tmpl"
# Expose Marathon apps by default in traefik
#
# Optional
# Default: false
#
exposedByDefault = false
# Convert Marathon groups to subdomains
# Default behavior: /foo/bar/myapp => foo-bar-myapp.{defaultDomain}
# with groupsAsSubDomains enabled: /foo/bar/myapp => myapp.bar.foo.{defaultDomain}
#
# Optional
# Default: false
#
groupsAsSubDomains = false
# Enable Marathon basic authentication
#
# Optional
#
[marathon.basic]
httpBasicAuthUser = "xxx"
httpBasicPassword = "xxxxxxx"
# To enable more detailed statistics
[web.statistics]
RecentErrors = 10
# To enable Traefik to export internal metrics to Prometheus
[web.metrics.prometheus]
Buckets=[0.1,0.3,1.2,5.0]
# TLS client configuration. https://golang.org/pkg/crypto/tls/#Config
#
# Optional
#
# [marathon.TLS]
# InsecureSkipVerify = true
[retry]
We use Traefik Healthcheck and below is the Marathon Labels for one of the service where we see the failure
"labels": {
"consul": "servicename-5654",
"external": "tag",
"internal": "tag",
"prod": "tag",
"5654": "tag",
"traefik.backend.healthcheck.path": "/healthcheck",
"traefik.backend.healthcheck.interval": "10s",
"traefik.frontend.rule": "PathPrefixStrip:/servicename",
"traefik.backend": "-servicename",
"traefik.portIndex": "0",
"traefik.enable": "true"
},
We are not able to identify a pattern. We were running Traefik serveres in C4.xlarge and changed the Instance Type to c4.2xlarge to check if any network bandwidth/ resource issue. Appreciate if your help in troubleshooting this issue.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 3
- Comments: 18 (8 by maintainers)
We’re experiencing the same issue: requests where the client disappears before a response could be sent are logged as errors.
The problem is that it’s really easy to reproduce. You can just load a site in your browser that takes 1s+ to load and hit reload multiple times before the page is fully rendered to “reissue” the request. The client now disappeared for the first requests and traefik logs a 500.
I think it’s debatable whether this should be reported as a status 500 error. IMHO it’s not an error. But even if we classify it as an error, I don’t think it should be reported as an “Internal Server Error”.
Our alerting system (based on grafana + prometheus) goes nuts if users are impatient and hit reload multiple times.