kubernetes: Kubelet gets "Timeout: Too large resource version" error from the API server after network outage

What happened: I have disconnected a node from network for a few minutes. After reconnecting, I keep receiving such error messages from kubelet on the node, even after 15 minutes in reconnected state:

May 13 19:44:54 k8s-node04 kubelet[598]: I0513 19:44:54.043005     598 trace.go:116] Trace[1747581308]: "Reflector ListAndWatch" name:object-"kube-system"/"default-token-h8dz9" (started: 2020-05-13 19:44:14.938918643 +0000 UTC m=+81978.107654790) (total time: 39.10398118s):
May 13 19:44:54 k8s-node04 kubelet[598]: Trace[1747581308]: [39.10398118s] [39.10398118s] END
May 13 19:44:54 k8s-node04 kubelet[598]: E0513 19:44:54.043090     598 reflector.go:178] object-"kube-system"/"default-token-h8dz9": Failed to list *v1.Secret: Timeout: Too large resource version: 159128021, current: 159127032
May 13 19:45:16 k8s-node04 kubelet[598]: I0513 19:45:16.944515     598 trace.go:116] Trace[527369896]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:135 (started: 2020-05-13 19:44:37.84920601 +0000 UTC m=+82001.017941865) (total time: 39.095209656s):
May 13 19:45:16 k8s-node04 kubelet[598]: Trace[527369896]: [39.095209656s] [39.095209656s] END
May 13 19:45:16 k8s-node04 kubelet[598]: E0513 19:45:16.944595     598 reflector.go:178] k8s.io/client-go/informers/factory.go:135: Failed to list *v1beta1.RuntimeClass: Timeout: Too large resource version: 159128061, current: 159127066
May 13 19:45:23 k8s-node04 kubelet[598]: I0513 19:45:23.959866     598 trace.go:116] Trace[243135295]: "Reflector ListAndWatch" name:k8s.io/kubernetes/pkg/kubelet/kubelet.go:517 (started: 2020-05-13 19:44:44.860565979 +0000 UTC m=+82008.029301834) (total time: 39.099201281s):
May 13 19:45:23 k8s-node04 kubelet[598]: Trace[243135295]: [39.099201281s] [39.099201281s] END
May 13 19:45:23 k8s-node04 kubelet[598]: E0513 19:45:23.959947     598 reflector.go:178] k8s.io/kubernetes/pkg/kubelet/kubelet.go:517: Failed to list *v1.Service: Timeout: Too large resource version: 159128031, current: 159127042
May 13 19:45:32 k8s-node04 kubelet[598]: I0513 19:45:32.752744     598 trace.go:116] Trace[1950236492]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:135 (started: 2020-05-13 19:44:53.65385557 +0000 UTC m=+82016.822591425) (total time: 39.098776276s):
May 13 19:45:32 k8s-node04 kubelet[598]: Trace[1950236492]: [39.098776276s] [39.098776276s] END
May 13 19:45:32 k8s-node04 kubelet[598]: E0513 19:45:32.752831     598 reflector.go:178] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.CSIDriver: Timeout: Too large resource version: 159128079, current: 159127090
May 13 19:45:35 k8s-node04 kubelet[598]: I0513 19:45:35.670924     598 trace.go:116] Trace[1207388769]: "Reflector ListAndWatch" name:object-"kube-system"/"kube-router-token-4px26" (started: 2020-05-13 19:44:56.566459557 +0000 UTC m=+82019.735195412) (total time: 39.104363817s):
May 13 19:45:35 k8s-node04 kubelet[598]: Trace[1207388769]: [39.104363817s] [39.104363817s] END
May 13 19:45:35 k8s-node04 kubelet[598]: E0513 19:45:35.671005     598 reflector.go:178] object-"kube-system"/"kube-router-token-4px26": Failed to list *v1.Secret: Timeout: Too large resource version: 159128021, current: 159127032
May 13 19:46:05 k8s-node04 kubelet[598]: I0513 19:46:05.472918     598 trace.go:116] Trace[308823067]: "Reflector ListAndWatch" name:object-"kube-system"/"default-token-h8dz9" (started: 2020-05-13 19:45:26.359131486 +0000 UTC m=+82049.527867341) (total time: 39.113684635s):
May 13 19:46:05 k8s-node04 kubelet[598]: Trace[308823067]: [39.113684635s] [39.113684635s] END
May 13 19:46:05 k8s-node04 kubelet[598]: E0513 19:46:05.473007     598 reflector.go:178] object-"kube-system"/"default-token-h8dz9": Failed to list *v1.Secret: Timeout: Too large resource version: 159128021, current: 159127032

What you expected to happen: I expect that after network recovery kubelet reconnects to the Apiserver as before, and after a recovery such timeouts do not occur.

How to reproduce it (as minimally and precisely as possible): Just have a node connected to the cluster. Then, disconnect it from the network for 3-4 minutes, then reconnect. Then observe kubelet’s logs.

Anything else we need to know?: I have strict tcp keepalive settings in place on master and worker nodes, but this should not be the cause.

# sysctl -a|grep net.ipv4|grep tcp_keep
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 600

Restarting kubelet solves the issue, the error messages disappear.

Environment:

  • Kubernetes version (use kubectl version):
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-16T11:56:40Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-16T11:48:36Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/arm"}
  • Cloud provider or hardware configuration: bare metal
  • OS (e.g: cat /etc/os-release):
# cat /etc/os-release 
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
  • Kernel (e.g. uname -a):
# uname -a
Linux k8s-node04 5.4.40 #2 SMP Sun May 10 13:03:41 UTC 2020 aarch64 GNU/Linux
  • Install tools: kubeadm
  • Network plugin and version (if this is a network-related bug): kube-router
  • Others:

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 29 (11 by maintainers)

Commits related to this issue

Most upvoted comments

I am trying to dig this. Now, I’ve written a little go prog which uses the same reflector what kubelet uses. It just starts a watch for CSIDrivers, and I’ve made client-go to print the last resourceVersion. Starting the program multiple times (i.e. connecting to different masters) produces the following:

$ ./main 
setLastSyncResourceVersion= 179271356
^C
$ ./main 
setLastSyncResourceVersion= 179271308
^C
$ ./main 
setLastSyncResourceVersion= 179271231
^C
$ ./main 
setLastSyncResourceVersion= 179271356
^C
$ ./main 
setLastSyncResourceVersion= 179271308
^C
$ ./main 
setLastSyncResourceVersion= 179271231
^C
$ ./main 
setLastSyncResourceVersion= 179271356
^C
$ ./main 
setLastSyncResourceVersion= 179271308
^C

Right now I dont exactly know what url is being fetched and what arguments are passed in, I still have to figure that out. But now it seems, that different masters return different resouceVersions. Etcd does not report any problems/issues.