rancher: Sporadic driver error during K8s upgrade: listen tcp 127.0.0.1:0: bind: address already in use

What kind of request is this (question/bug/enhancement/feature request): bug

Steps to reproduce (least amount of steps as possible):

Very similar issue to https://github.com/rancher/rancher/issues/19742

On two occasions now I’ve encountered separate customer issues where during a K8s upgrade, through the Rancher UI, the rolling upgrade will stall with this error:

Error starting driver: failed retrieving port for driver: listen tcp 127.0.0.1:0: bind: address already in Getting stuck here: https://github.com/rancher/kontainer-engine/blob/65780a8839e471d3842580d5d9075ae9a3b70a02/service/service.go#L400

This issue causes the cluster controller inside Rancher to try to bind to a port that is already in use. Some customers report that eventually it self heals and upgrade completes. The quick fix for this is to cycle the Rancher pods.

It’s sporadic in that it doesn’t appear to happen with each downstream cluster. The commonality I’ve seen with customers is this:

Custom cluster
Ubuntu 18.04
K8s version < 1.118.x

Result:

Other details that may be helpful:

Environment information

Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): v2.4.5
Installation option (single install/HA): HA

Cluster information

Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Custom
Machine type (cloud/VM/metal) and specifications (CPU/memory): cloud and VMware
Kubernetes version (use kubectl version):

v1.17.6 --> v1.18.6

Docker version (use docker version):

(paste the output here)

gz#11024

gz#12126

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 21 (16 by maintainers)

Most upvoted comments

PR’s merged at 2.5 and master

rawmind0 on Feb 3, 2021

Confirmed @Oats87 diagnose.

Telemetry client is not releasing nor reusing connections to Rancher API on every report() execution, every 6h by default. Depending on the size of the system, telemetry client is doing a bunch of connections: 1 rancher API, 1 every cluster, 1 every project/cluster,… The number of connections is growing up until they are exhausted, generating listen tcp 127.0.0.1:0: bind: address already in use errors..

As workaround, telemetry client may be killed at Rancher docker/pod to release connections. Rancher should restart telemetry client automatically.

Submitted PR https://github.com/rancher/telemetry/pull/47 to address the issue

To test it:

access to rancher docker/pod
force telemetry report() execution: curl -X POST http://localhost:8114/v1-telemetry/report
check the number of telemetry connections: netstat -natup |grep telemetry | wc -l
force telemetry report() execution: curl -X POST http://localhost:8114/v1-telemetry/report
check the number of telemetry connections: netstat -natup |grep telemetry | wc -l .

Telemetry connections should grow up on current version, but shouldn’t grow up on patched version

rawmind0 on Feb 1, 2021