rancher: Bug: kubectl hangs on 'rollout status' when using Rancher API

What kind of request is this (question/bug/enhancement/feature request): Bug

Steps to reproduce (least amount of steps as possible):

  1. Point kubectl to your Rancher kubeconfig for a cluster running Kubernetes 1.13.5
  2. Use kubectl client 1.14 or 1.15 and check a rollout status like the following command.
 kubectl -n cattle-system rollout status deployment rancher 

Result: Rollout status will hang for 8 to 15 minutes before completing.

Other details that may be helpful:

  1. If you use kubectl 1.13 against Rancher API on a 1.13.5 cluster then it will not hang
  2. If you use kubectl 1.13/1.14/1.15 against the kube-apiserver directly, it will not hang
  3. Bypassing load balancer with a 1.14/1.15 kubectl but still using the Rancher API will still result in the rollout status to hang.

Environment information

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI):2.1.8
  • Installation option (single install/HA):HA

Cluster information

  • Cluster type (Hosted/Infrastructure Provider/Custom/Imported): local rancher cluster and downstream cluster of unknown provider type.

  • Machine type (cloud/VM/metal) and specifications (CPU/memory):16GB of RAM, CPU unknown

  • Kubernetes version (use kubectl version):1.13.5

  • Docker version (use docker version):

Containers: 40
 Running: 37
 Paused: 0
 Stopped: 3
Images: 41
Server Version: 18.09.2
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: bb71b10fd8f58240ca47fbb579b9d1028eea7c84
runc version: 2b18fe1d885ee5083ef9f0838fee39b62d653e30
init version: fec3683
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.15.0-46-generic
Operating System: Ubuntu 16.04.6 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.67GiB
Name: drekar-rancher-na-np-a
ID: SH6R:CDOC:LJ6N:W25N:F36U:B2ZH:X7LB:WFSX:WJLG:MWRH:N23E:EP2G
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine

WARNING: No swap limit support

SURE-4958

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Reactions: 3
  • Comments: 35

Commits related to this issue

Most upvoted comments

Have identified the root cause and it turns out to be related to http2. Disable http2 in nginx ingress, the issue is resolved.

kubectl edit cm nginx-configuration -n ingress-nginx
apiVersion: v1
data:
  http2-max-field-size: 32k
  http2-max-header-size: 64k
  proxy-body-size: 1024m
  proxy-read-timeout: "1024"
  use-forwarded-headers: "true"
  use-http2: "false"
kind: ConfigMap

rancher version:2.6.9 kubectl version:1.20.11 I add annotations on ingress nginx.ingress.kubernetes.io/http2-push-preload: ‘false’ kubectl no longer hanging full annotations:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    meta.helm.sh/release-name: rancher
    meta.helm.sh/release-namespace: cattle-system
    nginx.ingress.kubernetes.io/http2-push-preload: 'false'
    nginx.ingress.kubernetes.io/proxy-connect-timeout: '200'
    nginx.ingress.kubernetes.io/proxy-read-timeout: '1800'
    nginx.ingress.kubernetes.io/proxy-send-timeout: '1800'

I am having the same problem using rancher v2.6.4 and kubernetes v1.22.7. When connecting via an authorized cluster endpoint instead of via the rancher authentication proxy this issue does not occur. Also on our clusters running rancher v2.5.1 and kubernetes v1.19.7 using the same configurations/pipelines we’re not experiencing this issue. The difference with the clusters using the newer rancher version is that for those rancher is hosted on an Azure cluster instead of on our own vps using docker.

This issue actually happens even when the kubectl version matches when deleting a pod. We’ve tested with the proxied config directly on the Rancher node so that it didn’t have to go through a load balancer or any other hops except for Rancher.