linkerd2: Do we handle shutdown incorrectly?
Someone recently brought this thread to my attention. Tim argues that applications must continue to handle new connections/requests after receiving SIGTERM. In short:
- When kubelet decides to terminate a pod, it sends a SIGTERM to its containers.
- These processes must continue to process connections, as peers may not observe the pod’s removal from a service immediately.
- After the pod’s
terminationGracePeriodSeconds
, the kubelet sends SIGKILL to all remaining pods.
Currently, we initiate process shutdown as soon as SIGTERM is received. In the proxy, we do this by refusing new connections/requests while letting existing connections/requests complete. The proxy terminates once all connections are closed.
This is incorrect!
Instead, we need to… do nothing.
This also applies to all of our controllers (both Go & Rust).
About this issue
- Original URL
- State: open
- Created a year ago
- Reactions: 4
- Comments: 15 (8 by maintainers)
It is especially annoying that linkerd-proxy will also reject new outgoing connections after receiving SIGTERM. This results in errors when the pod still has to process existing requests which often requires making connections to external services like databases or APIs.
Even worse, when running linkerd on the edge, e.g. ingress controllers, HTTP persistent connections from upstream proxies (e.g. WAF) are a major issue. When the pod starts the shutdown procedure the existing TCP connections will not get terminated and a lot of new HTTP requests will be received through them until the ingress controller starts shutting down the idle connections. However, since linkerd-proxy will refuse new connections directly after SIGTERM all of those requests that would require a new connection to a workload backend will fail. In my opinion, doing nothing on SIGTERM and waiting for SIGKILL would be the best default behavior for the vast majority of use cases. But at the very least, linkerd-proxy should not interfere with outgoing connections.
However, just waiting for SIGKILL might not be a good approach in cases where the pod has a very high value for terminationGracePeriodSeconds. For instance, some stateful services like RabbitMQ use a very long grace period in order to ensure that the service has enough time to persist the state. Usually, it will finish long before SIGKILL would be triggered and shuts down the main container which will stop the pod. If linkerd-proxy would wait for SIGKILL then it might keep the pod running for hours. IMHO the ideal solution would be an optional configuration that instructs linkerd-proxy to shut down once certain containers don’t run anymore. Argo Workflows has an implementation for watching the main container that works quite well.
Kubernetes 1.28 brings “API awareness of sidecar containers”. Containers marked as sidecars will then get terminated automatically once all non-sidecar containers have shut down.
I would appreciate if the shutdown behavior would be as configurable as possible. Different use-cases are going to warrant different behaviors to optimize shutdown. It may make sense to stop accepting connections immediately upon receiving SIGTERM in some contexts while in others you may want to continue accepting new connections.
For instance, one example where I’d prefer accepting new connections after SIGTERM is when using Ingress NGINX Controller. Since it’s functioning as a proxy, my team has observed that Nginx may continue opening upstream connections while draining connections during shutdown if it has no existing keepalive connection to that upstream service. In order to gracefully process any in-flight requests it needs the ability to keep opening new connections. We currently use
config.alpha.linkerd.io/proxy-wait-before-exit-seconds
to work around this.Sadly, in alpha though, and as a feature gate.
This problem with this approach is that it basically requires that clients see connection errors. But these errors are entirely avoidable. Clients can get discovery updates on their own and gracefully move load. When we fail to allow connections, we introduce more errors into the system. Your argument is that these errors are inherently better than errors that might be encountered if connections are accepted and then later terminated, but I don’t agree with this assertion.
If an operator configures a
terminationGracePeriodSeconds: 30
, so that their clients have 30 seconds to gracefully move load, then I think it’s reasonable to honor that configuration so that clients can properly respond to discovery updates and all changes can be driven safely, without any interruption.I don’t think it’s appropriate to for Linkerd’s behavior to be less graceful than the application’s. If it continues to accept connections, so should we.
In my view, the only benefit of your proposed approach is that it would allow pods to terminate more quickly without potentially running idle during the termination grace period. I.e., if service discovery propagates within 3 seconds, we could spend 27 seconds doing nothing.