rancher: Sporadic driver error during K8s upgrade: listen tcp 127.0.0.1:0: bind: address already in use

What kind of request is this (question/bug/enhancement/feature request): bug

Steps to reproduce (least amount of steps as possible):

Very similar issue to https://github.com/rancher/rancher/issues/19742

On two occasions now I’ve encountered separate customer issues where during a K8s upgrade, through the Rancher UI, the rolling upgrade will stall with this error:

Error starting driver: failed retrieving port for driver: listen tcp 127.0.0.1:0: bind: address already in Getting stuck here: https://github.com/rancher/kontainer-engine/blob/65780a8839e471d3842580d5d9075ae9a3b70a02/service/service.go#L400

This issue causes the cluster controller inside Rancher to try to bind to a port that is already in use. Some customers report that eventually it self heals and upgrade completes. The quick fix for this is to cycle the Rancher pods.

It’s sporadic in that it doesn’t appear to happen with each downstream cluster. The commonality I’ve seen with customers is this:

  • Custom cluster
  • Ubuntu 18.04
  • K8s version < 1.118.x

Result:

Other details that may be helpful:

Environment information

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): v2.4.5
  • Installation option (single install/HA): HA

Cluster information

  • Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Custom
  • Machine type (cloud/VM/metal) and specifications (CPU/memory): cloud and VMware
  • Kubernetes version (use kubectl version):
v1.17.6 --> v1.18.6
  • Docker version (use docker version):
(paste the output here)

gz#11024

gz#12126

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 21 (16 by maintainers)

Most upvoted comments

PR’s merged at 2.5 and master

Confirmed @Oats87 diagnose.

Telemetry client is not releasing nor reusing connections to Rancher API on every report() execution, every 6h by default. Depending on the size of the system, telemetry client is doing a bunch of connections: 1 rancher API, 1 every cluster, 1 every project/cluster,… The number of connections is growing up until they are exhausted, generating listen tcp 127.0.0.1:0: bind: address already in use errors..

As workaround, telemetry client may be killed at Rancher docker/pod to release connections. Rancher should restart telemetry client automatically.

Submitted PR https://github.com/rancher/telemetry/pull/47 to address the issue

To test it:

  1. access to rancher docker/pod
  2. force telemetry report() execution: curl -X POST http://localhost:8114/v1-telemetry/report
  3. check the number of telemetry connections: netstat -natup |grep telemetry | wc -l
  4. force telemetry report() execution: curl -X POST http://localhost:8114/v1-telemetry/report
  5. check the number of telemetry connections: netstat -natup |grep telemetry | wc -l .

Telemetry connections should grow up on current version, but shouldn’t grow up on patched version