traefik: Traefik 3.0: StartTLS connection hanging if connection initiated when upstream unavailable
Welcome!
- Yes, I’ve searched similar issues on GitHub and didn’t find any.
- Yes, I’ve searched similar issues on the Traefik community forum and didn’t find any.
What did you do?
Hello, I am using the pre-release Traefik version 3 for this feature: https://github.com/traefik/traefik/pull/9377
It is working great, and we have noticed one possible issue.
If we establish a connection to Traefik while the upstream server is not available, the connection will hang open. We see the connection remains to hang, not completing the fully connection to the upstream even after the upstream becomes available.
What did you see instead?
Example order of events:
- upstream 1 replica
- pod deleted
- new pod starting
- request from client to that workload through traefik startTLS connection (TCP connection is accepted, remains open but does not complete connection to upstream).
- pod is up and ready
- TCP connection from client to traefik remains open and is not connected into the upstream (pod that is now ready)
What version of Traefik are you using?
Version: 3.0.0-beta2 Codename: beaufort Go version: go1.19.4 Built: 2022-12-07T16:32:34Z OS/Arch: linux/amd64
What is your environment & configuration?
image:
# Version 3 is in pre-release and has the feature we need
# To support the StartTLS part of the PostgreSQL protocol
tag: v3.0.0-beta2
logs:
access:
enabled: true
service:
type: LoadBalancer
annotations:
service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: deregistration_delay.timeout_seconds=200,deregistration_delay.connection_termination.enabled=true,preserve_client_ip.enabled=false,stickiness.enabled=true,stickiness.type=source_ip
service.beta.kubernetes.io/aws-load-balancer-healthcheck-path: /ping
service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: "9000"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: HTTP
service.beta.kubernetes.io/aws-load-balancer-internal: "true"
service.beta.kubernetes.io/aws-load-balancer-scheme: internal
service.beta.kubernetes.io/aws-load-balancer-type: nlb-ip
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "2"
service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
additionalArguments:
- "--entryPoints.postgresql.address=:5432/tcp"
ports:
postgresql:
expose: true
port: 5432
exposedPort: 5432
protocol: TCP
deployment:
# These configurations allow for the NLB to drain connections
# An NLB can take up to 180 seconds to stop routing TCP connections
# to a target.
terminationGracePeriodSeconds: 200
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 180"]
resources:
# Will depend on how many connections our clients
# have, the transaction rate, and the reconnect
# frequency.
requests:
cpu: "1"
memory: "1Gi"
limit:
cpu: "4"
memory: "1500Mi"
updateStrategy:
type: RollingUpdate
# This configuration allows for
# minimizing the number of times clients
# need to reconnect to only 1 time if we are
# updating the deployment.
rollingUpdate:
maxUnavailable: 0
maxSurge: 100%
podDisruptionBudget:
enabled: true
minAvailable: 3
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
# If multiple metrics are specified in a HorizontalPodAutoscaler,
# this calculation is done for each metric, and then the largest of
# the desired replica counts is chosen.
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
# % of requests, number can
# exceed 100%
averageUtilization: 99
- type: Resource
resource:
name: memory
target:
type: Utilization
# % of requests, number can
# exceed 100%
averageUtilization: 60
behavior:
scaleDown:
# Maximium cooldown for scaling down
# is 1 hour
stabilizationWindowSeconds: 3600
policies:
- type: Percent
# This means allow for scaling down as many
# pods as we want in a single scale-down,
# above the minimium count.
# This is the default value.
value: 100
# This means how long the policy must hold true
# before scaling down.
periodSeconds: 900
Add more configuration information here.
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRouteTCP
metadata:
name: org-*****-dev
namespace: org-*****-dev
spec:
entryPoints:
- postgresql
routes:
- match: HostSNI(`*****.data-1.use1.****.com`)
services:
- name: org-*****-dev
port: 5432
tls:
passthrough: true
If applicable, please paste the log output in DEBUG level
No response
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 22 (6 by maintainers)
@sjmiller609 Sure!
At a glance, we are unsure to grasp every subtlety of the approach, especially on the client side (what would in this approach make the client consider the connection closed?), and the potential side effects in Traefik itself. Have you already implemented it and tested it?
Anyway, we think that it is worth going further and we would gladly welcome a PR for this. We cannot commit to merging it, but we for sure will be reviewing it.
Thanks!
@rtribotte suggestion works!
https://github.com/traefik/traefik/pull/10089#issuecomment-1773515235
Just another case where the connection can hang : if the client tries to use GSSAPI. I just has a situation where dbeaver was working nicely, but command line tools (pg_dump, pg_restore, psql) was failing. The problem is that those tools will first try to negociate GSSAPI encryption (even if we set sslmode=verify-full).
Failing command
The connexion will just hang. The sent bytes in this case are
To work around this, we can explicitely disable GSSAPI with
Not sure what the correct Traefik behavior should be here
@rtribotte
Also, here is a repository where we can reproduce the issue locally:
https://github.com/sjmiller609/reproduce-issues/tree/main/traefik-psql-ingress
@triethuynhedulog
Hello @sjmiller609,
Thanks for your interest in Traefik and for reporting this!
Could you please share the complete debug log of Traefik during the sequence you described? Also, are you using the
allowEmptyServices
option? (please take a look at this documentation)To give you more insights, your proposals make sense, but we suspect that the cause of your problem could be linked to the StartTLS feature itself, so we need to investigate.