traefik: Traefik 3.0: StartTLS connection hanging if connection initiated when upstream unavailable

Welcome!

Yes, I’ve searched similar issues on GitHub and didn’t find any.
Yes, I’ve searched similar issues on the Traefik community forum and didn’t find any.

What did you do?

Hello, I am using the pre-release Traefik version 3 for this feature: https://github.com/traefik/traefik/pull/9377

It is working great, and we have noticed one possible issue.

If we establish a connection to Traefik while the upstream server is not available, the connection will hang open. We see the connection remains to hang, not completing the fully connection to the upstream even after the upstream becomes available.

What did you see instead?

Example order of events:

upstream 1 replica
pod deleted
new pod starting
request from client to that workload through traefik startTLS connection (TCP connection is accepted, remains open but does not complete connection to upstream).
pod is up and ready
TCP connection from client to traefik remains open and is not connected into the upstream (pod that is now ready)

What version of Traefik are you using?

Version: 3.0.0-beta2 Codename: beaufort Go version: go1.19.4 Built: 2022-12-07T16:32:34Z OS/Arch: linux/amd64

What is your environment & configuration?

          image:
            # Version 3 is in pre-release and has the feature we need
            # To support the StartTLS part of the PostgreSQL protocol
            tag: v3.0.0-beta2
          logs:
            access:
              enabled: true
          service:
            type: LoadBalancer
            annotations:
              service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: deregistration_delay.timeout_seconds=200,deregistration_delay.connection_termination.enabled=true,preserve_client_ip.enabled=false,stickiness.enabled=true,stickiness.type=source_ip
              service.beta.kubernetes.io/aws-load-balancer-healthcheck-path: /ping
              service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: "9000"
              service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: HTTP
              service.beta.kubernetes.io/aws-load-balancer-internal: "true"
              service.beta.kubernetes.io/aws-load-balancer-scheme: internal
              service.beta.kubernetes.io/aws-load-balancer-type: nlb-ip
              service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"
              service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2"
              service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "2"
              service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
          additionalArguments:
            - "--entryPoints.postgresql.address=:5432/tcp"
          ports:
            postgresql:
              expose: true
              port: 5432
              exposedPort: 5432
              protocol: TCP
          deployment:
            # These configurations allow for the NLB to drain connections
            # An NLB can take up to 180 seconds to stop routing TCP connections
            # to a target.
            terminationGracePeriodSeconds: 200
            lifecycle:
              preStop:
                exec:
                  command: ["/bin/sh", "-c", "sleep 180"]
          resources:
            # Will depend on how many connections our clients
            # have, the transaction rate, and the reconnect
            # frequency.
            requests:
              cpu: "1"
              memory: "1Gi"
            limit:
              cpu: "4"
              memory: "1500Mi"
          updateStrategy:
            type: RollingUpdate
            # This configuration allows for
            # minimizing the number of times clients
            # need to reconnect to only 1 time if we are
            # updating the deployment.
            rollingUpdate:
              maxUnavailable: 0
              maxSurge: 100%
          podDisruptionBudget:
            enabled: true
            minAvailable: 3
          autoscaling:
            enabled: true
            minReplicas: 3
            maxReplicas: 10
            # If multiple metrics are specified in a HorizontalPodAutoscaler,
            # this calculation is done for each metric, and then the largest of
            # the desired replica counts is chosen.
            metrics:
            - type: Resource
              resource:
                name: cpu
                target:
                  type: Utilization
                  # % of requests, number can
                  # exceed 100%
                  averageUtilization: 99
            - type: Resource
              resource:
                name: memory
                target:
                  type: Utilization
                  # % of requests, number can
                  # exceed 100%
                  averageUtilization: 60
            behavior:
              scaleDown:
                # Maximium cooldown for scaling down
                # is 1 hour
                stabilizationWindowSeconds: 3600
                policies:
                - type: Percent
                  # This means allow for scaling down as many
                  # pods as we want in a single scale-down,
                  # above the minimium count.
                  # This is the default value.
                  value: 100
                  # This means how long the policy must hold true
                  # before scaling down.
                  periodSeconds: 900

Add more configuration information here.

apiVersion: traefik.containo.us/v1alpha1
kind: IngressRouteTCP
metadata:
  name: org-*****-dev
  namespace: org-*****-dev
spec:
  entryPoints:
  - postgresql
  routes:
  - match: HostSNI(`*****.data-1.use1.****.com`)
    services:
    - name: org-*****-dev
      port: 5432
  tls:
    passthrough: true

If applicable, please paste the log output in DEBUG level

No response

About this issue

Original URL
State: closed
Created a year ago
Comments: 22 (6 by maintainers)

Most upvoted comments

@sjmiller609 Sure!

At a glance, we are unsure to grasp every subtlety of the approach, especially on the client side (what would in this approach make the client consider the connection closed?), and the potential side effects in Traefik itself. Have you already implemented it and tested it?

Anyway, we think that it is worth going further and we would gladly welcome a PR for this. We cannot commit to merging it, but we for sure will be reviewing it.

Thanks!

rtribotte on Aug 23, 2023

@rtribotte suggestion works!

https://github.com/traefik/traefik/pull/10089#issuecomment-1773515235

sjmiller609 on Oct 20, 2023

Just another case where the connection can hang : if the client tries to use GSSAPI. I just has a situation where dbeaver was working nicely, but command line tools (pg_dump, pg_restore, psql) was failing. The problem is that those tools will first try to negociate GSSAPI encryption (even if we set sslmode=verify-full).

Failing command

psql "host=postgres.example.org user=postgres dbname=postgres sslmode=verify-full"

The connexion will just hang. The sent bytes in this case are

00 00 00 08 04 d2 16 30

To work around this, we can explicitely disable GSSAPI with

psql "gssencmode=disable host=postgres.example.org user=postgres dbname=postgres sslmode=verify-full"

Not sure what the correct Traefik behavior should be here

dani on Oct 17, 2023

@rtribotte

{"routers":{"api@internal":{"entryPoints":["traefik"],"service":"api@internal","rule":"PathPrefix(`/api`)","priority":2147483646,"status":"enabled","using":["traefik"]},"dashboard@internal":{"entryPoints":["traefik"],"middlewares":["dashboard_redirect@internal","dashboard_stripprefix@internal"],"service":"dashboard@internal","rule":"PathPrefix(`/`)","priority":2147483645,"status":"enabled","using":["traefik"]},"debug@internal":{"entryPoints":["traefik"],"service":"api@internal","rule":"PathPrefix(`/debug`)","priority":2147483646,"status":"enabled","using":["traefik"]},"ping@internal":{"entryPoints":["traefik"],"service":"ping@internal","rule":"PathPrefix(`/ping`)","priority":2147483647,"status":"enabled","using":["traefik"]},"prometheus@internal":{"entryPoints":["metrics"],"service":"prometheus@internal","rule":"PathPrefix(`/metrics`)","priority":2147483647,"status":"enabled","using":["metrics"]}},"middlewares":{"dashboard_redirect@internal":{"redirectRegex":{"regex":"^(http:\\/\\/(\\[[\\w:.]+\\]|[\\w\\._-]+)(:\\d+)?)\\/$","replacement":"${1}/dashboard/","permanent":true},"status":"enabled","usedBy":["dashboard@internal"]},"dashboard_stripprefix@internal":{"stripPrefix":{"prefixes":["/dashboard/","/dashboard"]},"status":"enabled","usedBy":["dashboard@internal"]}},"services":{"api@internal":{"status":"enabled","usedBy":["api@internal","debug@internal"]},"dashboard@internal":{"status":"enabled","usedBy":["dashboard@internal"]},"noop@internal":{"status":"enabled"},"ping@internal":{"status":"enabled","usedBy":["ping@internal"]},"prometheus@internal":{"status":"enabled","usedBy":["prometheus@internal"]}},"tcpRouters":{"customer-1-hippo-1-c539bbaafd097714bd0e@kubernetescrd":{"entryPoints":["postgresql"],"service":"customer-1-hippo-1-c539bbaafd097714bd0e","rule":"HostSNI(`localhost`)","tls":{"passthrough":true},"error":["the service \"customer-1-hippo-1-c539bbaafd097714bd0e@kubernetescrd\" does not exist"],"status":"disabled","using":["postgresql"]}}}

Also, here is a repository where we can reproduce the issue locally:

https://github.com/sjmiller609/reproduce-issues/tree/main/traefik-psql-ingress

sjmiller609 on Jun 21, 2023

@triethuynhedulog

20.8.0

sjmiller609 on Jun 12, 2023

Hello @sjmiller609,

Thanks for your interest in Traefik and for reporting this!

Could you please share the complete debug log of Traefik during the sequence you described? Also, are you using the allowEmptyServices option? (please take a look at this documentation)

To give you more insights, your proposals make sense, but we suspect that the cause of your problem could be linked to the StartTLS feature itself, so we need to investigate.

rtribotte on Jun 7, 2023