linkerd2: Do we handle shutdown incorrectly?

Someone recently brought this thread to my attention. Tim argues that applications must continue to handle new connections/requests after receiving SIGTERM. In short:

When kubelet decides to terminate a pod, it sends a SIGTERM to its containers.
These processes must continue to process connections, as peers may not observe the pod’s removal from a service immediately.
After the pod’s terminationGracePeriodSeconds, the kubelet sends SIGKILL to all remaining pods.

Currently, we initiate process shutdown as soon as SIGTERM is received. In the proxy, we do this by refusing new connections/requests while letting existing connections/requests complete. The proxy terminates once all connections are closed.

This is incorrect!

Instead, we need to… do nothing.

This also applies to all of our controllers (both Go & Rust).

About this issue

Original URL
State: open
Created a year ago
Reactions: 4
Comments: 15 (8 by maintainers)

Commits related to this issue

Do not gracefully drain the proxy on SIGTERM Per the discussion in linkerd/linkerd2#10379, the proxy's shutdown behavior is overly aggressive and does not properly honor a pod's `terminationGracePeri... — committed to linkerd/linkerd2-proxy by olix0r a year ago

Most upvoted comments

It is especially annoying that linkerd-proxy will also reject new outgoing connections after receiving SIGTERM. This results in errors when the pod still has to process existing requests which often requires making connections to external services like databases or APIs.

Even worse, when running linkerd on the edge, e.g. ingress controllers, HTTP persistent connections from upstream proxies (e.g. WAF) are a major issue. When the pod starts the shutdown procedure the existing TCP connections will not get terminated and a lot of new HTTP requests will be received through them until the ingress controller starts shutting down the idle connections. However, since linkerd-proxy will refuse new connections directly after SIGTERM all of those requests that would require a new connection to a workload backend will fail. In my opinion, doing nothing on SIGTERM and waiting for SIGKILL would be the best default behavior for the vast majority of use cases. But at the very least, linkerd-proxy should not interfere with outgoing connections.

However, just waiting for SIGKILL might not be a good approach in cases where the pod has a very high value for terminationGracePeriodSeconds. For instance, some stateful services like RabbitMQ use a very long grace period in order to ensure that the service has enough time to persist the state. Usually, it will finish long before SIGKILL would be triggered and shuts down the main container which will stop the pod. If linkerd-proxy would wait for SIGKILL then it might keep the pod running for hours. IMHO the ideal solution would be an optional configuration that instructs linkerd-proxy to shut down once certain containers don’t run anymore. Argo Workflows has an implementation for watching the main container that works quite well.

Tolsto on Jun 14, 2023

Kubernetes 1.28 brings “API awareness of sidecar containers”. Containers marked as sidecars will then get terminated automatically once all non-sidecar containers have shut down.

Tolsto on Aug 31, 2023

I would appreciate if the shutdown behavior would be as configurable as possible. Different use-cases are going to warrant different behaviors to optimize shutdown. It may make sense to stop accepting connections immediately upon receiving SIGTERM in some contexts while in others you may want to continue accepting new connections.

For instance, one example where I’d prefer accepting new connections after SIGTERM is when using Ingress NGINX Controller. Since it’s functioning as a proxy, my team has observed that Nginx may continue opening upstream connections while draining connections during shutdown if it has no existing keepalive connection to that upstream service. In order to gracefully process any in-flight requests it needs the ability to keep opening new connections. We currently use config.alpha.linkerd.io/proxy-wait-before-exit-seconds to work around this.

GabrielAlacchi on Mar 25, 2023

Kubernetes 1.28 brings “API awareness of sidecar containers”. Containers marked as sidecars will then get terminated automatically once all non-sidecar containers have shut down.

Sadly, in alpha though, and as a feature gate.

nathanmcgarvey-modopayments on Aug 31, 2023

When a pod gets SIGTERM, it stops accepting new connections but allows in-flight connections to progress

Clients may not yet know the pod is terminating and may attempt to establish a connection to it, which will be rejected

Since the connection was rejected before any payload bytes were sent, the client can safely retry the connection. This retry will eventually hit another server replica, either by chance or because the client will eventually get the service discovery update.

The benefit to this approach is that we don’t establish a connection to a server that we know will be forcefully killed within the next gracePeriod. The work happening on the connection has the best chance to succeed if it is established to a server that isn’t known to be terminating soon. In a sense, refusing new connections augments service discovery by guiding clients away from the terminating server, even if they haven’t received that information from service discovery yet.

This problem with this approach is that it basically requires that clients see connection errors. But these errors are entirely avoidable. Clients can get discovery updates on their own and gracefully move load. When we fail to allow connections, we introduce more errors into the system. Your argument is that these errors are inherently better than errors that might be encountered if connections are accepted and then later terminated, but I don’t agree with this assertion.

If an operator configures a terminationGracePeriodSeconds: 30, so that their clients have 30 seconds to gracefully move load, then I think it’s reasonable to honor that configuration so that clients can properly respond to discovery updates and all changes can be driven safely, without any interruption.

I don’t think it’s appropriate to for Linkerd’s behavior to be less graceful than the application’s. If it continues to accept connections, so should we.

In my view, the only benefit of your proposed approach is that it would allow pods to terminate more quickly without potentially running idle during the termination grace period. I.e., if service discovery propagates within 3 seconds, we could spend 27 seconds doing nothing.

olix0r on Feb 24, 2023