kubernetes: Pod looses network connection (connection refused errors) during graceful shutdown period
What happened?
Hi,
Our backend infrastructure uses Kubernetes pods to perform analysis of data. As the flow of data increases and decreases through the day, the Kubernetes auto scaler scales up and scales down the pods.
The analysis operation on a single workload can take up to six minutes. To enable successful completion of in flight scan operation, the pods
- have a terminationGracePeriodSeconds set to 360 seconds
- capture the SIGTERM event and prevent the pod from accepting new requests
SpringApplication springApplication = new SpringApplication(ProductApplication.class); springApplication.addListeners(new GracefulShutdownListener()); GracefulShutdownListener -> onApplicationEvent(ContextClosedEvent event) 1. Prevent new requests from being sent 2. Sleep for 6 minutes
The issue is that thought the pods are waiting for 6 minutes to complete in flight operation, these operation fail in outward network communication. Thus the pod gets a Connection Refused
error when communicating with SNS in the graceful shutdown phase.
Investigation reveal that pods are failing in external network communication with multiple applications during the ‘graceful shutdown’ phase.
I have gone through multiple tickets /documents in a similar area
- https://github.com/kubernetes/kubernetes/issues/44956
- https://github.com/ardanlabs/service/issues/189
- https://freecontent.manning.com/handling-client-requests-properly-with-kubernetes/
One comment on the same area seems to be https://github.com/kubernetes/kubernetes/issues/86280#issuecomment-583173036
I am raising this ticket as we require the ability to create new external connections (SNS) during graceful shutdown.
Primarily I see in the document https://freecontent.manning.com/handling-client-requests-properly-with-kubernetes/, that a B flow is started on pod deletion, which removed pods from ip tables. It seems to me that this may result in the pod not being able to create new network connections. If this is the case we would need a mechanism to delay the B flow until the grace period is done.
I am not sure if sleeping in the pre-stop hook is the recommended mechanism to prevent pods from loosing network connection during the graceful shutdown phase.
What did you expect to happen?
Any Kubernetes pod during it’s graceful shutdown period should have unrestricted access to required resources (network) and the ability to create new connections. This ability need not be by default and could also be by setting a configuration.
How can we reproduce it (as minimally and precisely as possible)?
- Start a pod which perpetually creates a new connection with and updates an external entity
- Terminate the pod manually
Connection refused should be seen in attempts to create new connection Connection refused error may be seen for readiness probe.
We have not found this to always be reproducible in our test clusters, however can be seen very frequently in our clusters server continuous data.
Anything else we need to know?
The evidences that we can see indicating that pod is loosing ability to communicate with external components
- Connection refused error to multiple components, during graceful shut down phase.
- pod events display connectino refused event for readiness proble a minute after shutdown
38m Normal Killing pod/scan-434f242-fr23e Stopping container xyz
38m Normal Killing pod/scan-434f242-fr23e Stopping container pqr
38m Normal Killing pod/scan-434f242-fr23e Stopping container abc
38m Warning FailedPreStopHook pod/s434f242-fr23e Exec lifecycle hook ([]) for Container "abc" in Pod "scan-434f242-fr23e_dss(19451a4f-2w32-4221-we32-4a3b0b169a7c)" failed - error: command '' exited with 126: , message: "OCI runtime exec failed: exec failed: container_linux.go:370: starting container process caused: exec: \"\": executable file not found in $PATH: unknown\r\n"
37m Warning Unhealthy pod/scan-434f242-fr23e Readiness probe failed: Get http://10.111.2.71:8080/actuator/health/readiness: dial tcp 10.203.9.71:8080: connect: connection refused
33m Warning Unhealthy pod/scan-434f242-fr23e Readiness probe failed: Get http://10.111.2.71:4191/ready: dial tcp 10.203.9.71:4191: connect: connection refused
38m Warning Unhealthy pod/scan-434f242-fr23e Liveness probe failed: Get http://10.112.2.71:4191/live: dial tcp 10.111.2.71:4191: connect: connection refused
38m Warning Unhealthy pod/scan-434f242-fr23e Liveness probe failed: Get http://10.111.2.71:8080/actuator/health/liveness: dial tcp 10.203.9.71:8080: connect: connection refused```
This does not always reproduce.
### Kubernetes version
<details>
```console
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.7", GitCommit:"1dd5338295409edcfff11505e7bb246f0d325d15", GitTreeState:"clean", BuildDate:"2021-01-13T13:23:52Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.17-eks-087e67", GitCommit:"087e67e479962798594218dc6d99923f410c145e", GitTreeState:"clean", BuildDate:"2021-07-31T01:39:55Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider
OS version
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
Install tools
Container runtime (CRI) and and version (if applicable)
Related plugins (CNI, CSI, …) and versions (if applicable)
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 3
- Comments: 22 (10 by maintainers)
Hi @nilesh-telang , we’ve resolved the issue on our cluster. We’re using Calico CNI, which had a bug in it. After upgrading our Calico version our terminating pods retained their network connectivity, though I see you’ve already ruled this out earlier.
Hi @matthewbyrne ,
Thanks for sharing these details.
Hi @thockin, at our end I think we may have found out the root cause of our issue. Our production systems use linkerd and we found that the graceful shutdown of linkerd may be causing the pod to loose connectivity to external components
The linkerd shutdown documentation mentions the following - https://linkerd.io/2.10/tasks/graceful-shutdown/
We have made changes in production to delay the shutdown of the linkerd such that the linkerd container is alive for the duration of the main container’s graceful shutdown period. Based on the documentation, I believe this should address the issue.
I will update and mark this ticket resolved once the verification of the fix is done.
Thank you all for you help and inputs on this issue.
@nilesh-telang what CNI/network plugin are you using in the cluster?
Also, kubelet and docker logs showing the time sequence of when the container sandbox stop request begins and when it finally ends might help. There are also CRI operation timeouts that could be in play here, but kubelet and docker logs will help figure that out.
Thank you @aojea again for the clarification and the prompt response.
I think your last response clarifies the root cause of what we are observing. You have mentioned that endpoints will be removed as soon as the pod disappears from the endpoint object, the behavior is that “existing” TCP connections to the Service are not cleared.
I believe that this explains what we are observing. During the pod shutdown, I believe the pod is removed from the endpoint object. At this point the pod does not have an active TCP connection with SNS, and thus the connection with SNS is not retained. The fact that pod is removed from the end point may be the reason, that pod is unable to make further communications with SNS.
Retaining the connection with SNS as a keep alive TCP connection may the required route for enabling SNS connections during the shutdown phase. I have also added code to create a new SNSClient when the failure is encountered, to verify if creating a new connection via a new client addresses the issue. I will be running the tests in the coming few days and will udpate the ticket with the results. This will verify the ability to create a new SNS connection during the grace period.