kubernetes: Websocket connection failures

What happened: We are running multiple Azure Kubernetes services that accept long running websocket connections. These services are externally exposed through an Azure load balancer and they handle multiple concurrent connections from clients. During these connections, the client sends a fairly constant stream of data packets every 50ms - 100ms and each of these packets contain a few KBs of data. The service returns data packets every few seconds containing a few hundred bytes each.

At some point during the connection, usually after 30-40 minutes, the websocket connection gets into a bad state with the following symptoms:

The client sends data packets to the service, but the service never receives it. There is no error reported on either side.
When either the client or the service attempts to close the websocket connection, the close handshake will never happen and the other side will not detect the disconnect.

We observed these symptoms in at least two different clusters and with two different services.

In one cluster we use Contour as the ingress controller and changing the external traffic policy of the Contour service from ClusterIP to Local solved the issue.
We have another cluster where setting the external traffic policy to Local did not solve the problem. In this cluster Contour is not being used.

Is this a known issue? Since changing the external traffic policy seems to affect the problem in some cases, it seems that the issue is somehow related to an extra hop that is added when the traffic initially lands on a node that does not have the service running. We found this article that describes the difference between the ClusterIP and Local external traffic policies: https://www.asykim.com/blog/deep-dive-into-kubernetes-external-traffic-policies

What you expected to happen: Long running websocket connections should stay in a connected state.

How to reproduce it (as minimally and precisely as possible): There is no easy way to reproduce this. The websocket failures seems to happens more often when the number of concurrent connections handled by the service is higher.

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): 1.14.1
Cloud provider or hardware configuration: Azure
OS (e.g: cat /etc/os-release): Ubuntu 16.04.6 LTS (Xenial Xerus)
Kernel (e.g. uname -a): 4.15.0-1057-azure #62-Ubuntu SMP Thu Sep 5 18:25:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Network plugin and version (if this is a network-related bug): azure-vnet CNI 1.0.22
Others:

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 1
Comments: 18 (14 by maintainers)

Most upvoted comments

@danwinship: The label(s) platform/azure cannot be applied. These labels are supported: api-review, community/discussion, community/maintenance, community/question, cuj/build-train-deploy, cuj/multi-user, platform/aws, platform/azure, platform/gcp, platform/minikube, platform/other

In response to this:

no, /label platform/azure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on Sep 20, 2019