kubernetes: Websocket connection failures
What happened: We are running multiple Azure Kubernetes services that accept long running websocket connections. These services are externally exposed through an Azure load balancer and they handle multiple concurrent connections from clients. During these connections, the client sends a fairly constant stream of data packets every 50ms - 100ms and each of these packets contain a few KBs of data. The service returns data packets every few seconds containing a few hundred bytes each.
At some point during the connection, usually after 30-40 minutes, the websocket connection gets into a bad state with the following symptoms:
- The client sends data packets to the service, but the service never receives it. There is no error reported on either side.
- When either the client or the service attempts to close the websocket connection, the close handshake will never happen and the other side will not detect the disconnect.
We observed these symptoms in at least two different clusters and with two different services.
- In one cluster we use Contour as the ingress controller and changing the external traffic policy of the Contour service from ClusterIP to Local solved the issue.
- We have another cluster where setting the external traffic policy to Local did not solve the problem. In this cluster Contour is not being used.
Is this a known issue? Since changing the external traffic policy seems to affect the problem in some cases, it seems that the issue is somehow related to an extra hop that is added when the traffic initially lands on a node that does not have the service running. We found this article that describes the difference between the ClusterIP and Local external traffic policies: https://www.asykim.com/blog/deep-dive-into-kubernetes-external-traffic-policies
What you expected to happen: Long running websocket connections should stay in a connected state.
How to reproduce it (as minimally and precisely as possible): There is no easy way to reproduce this. The websocket failures seems to happens more often when the number of concurrent connections handled by the service is higher.
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version): 1.14.1 - Cloud provider or hardware configuration: Azure
- OS (e.g:
cat /etc/os-release): Ubuntu 16.04.6 LTS (Xenial Xerus) - Kernel (e.g.
uname -a): 4.15.0-1057-azure #62-Ubuntu SMP Thu Sep 5 18:25:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux - Install tools:
- Network plugin and version (if this is a network-related bug): azure-vnet CNI 1.0.22
- Others:
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 1
- Comments: 18 (14 by maintainers)
@danwinship: The label(s)
platform/azurecannot be applied. These labels are supported:api-review, community/discussion, community/maintenance, community/question, cuj/build-train-deploy, cuj/multi-user, platform/aws, platform/azure, platform/gcp, platform/minikube, platform/otherIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.