kubernetes: range_allocator.go crashes the entire KCM if CIDRs are incorrect for a single node

What happened:

In cases where asingle rogue node has the wrong CIDR, the entire KCM can go down . it seems like this type of problem shouldnt cause a KCM to go down. The best example here is, if you have a 50 node linux cluster and then experimentally add a new experimental node , and modify its podCIDR. In such a situation, you probably are expecting that the point modifications you made to the new node wont hurt other healthy nodes, which have correct Pod CIDRs.

I1028 12:47:44.800292       1 range_allocator.go:82] Sending events to api server.
I1028 12:47:44.800655       1 range_allocator.go:116] No Secondary Service CIDR provided. Skipping filtering out secondary service addresses.
I1028 12:47:44.800673       1 range_allocator.go:125] Node ubuntuk8s has CIDR 100.1.1.0/24, occupying it in CIDR map
I1028 12:47:44.800700       1 range_allocator.go:125] Node win-h0c364gqvjh has CIDR 100.1.2.2/24, occupying it in CIDR map
E1028 12:47:44.800758       1 controllermanager.go:537] Error starting "nodeipam"
F1028 12:47:44.800786       1 controllermanager.go:249] error starting controllers: failed to mark cidr[100.1.2.0/24] at idx [0] as occupied for node: win-h0c364gqvjh: cidr 100.1.2.0/24 is out the range of cluster cidr 100.1.1.0/24
goroutine 157 [running]:
k8s.io/kubernetes/vendor/k8s.io/klog/v2.stacks(0xc00011c001, 0xc000bc8c00, 0xe8, 0x1e9)
        /workspace/anago-v1.19.3-rc.0.69+37babbd0e76c11/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:996 +0xb9
k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).output(0x6a620c0, 0xc000000003, 0x0, 0x0, 0xc0002d25b0, 0x68bad49, 0x14, 0xf9, 0x0)
        /workspace/anago-v1.19.3-rc.0.69+37babbd0e76c11/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:945 +0x191
k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).printf(0x6a620c0, 0xc000000003, 0x0, 0x0, 0x4477426, 0x1e, 0xc000f1dad0, 0x1, 0x1)
        /workspace/anago-v1.19.3-rc.0.69+37babbd0e76c11/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:733 +0x17a
k8s.io/kubernetes/vendor/k8s.io/klog/v2.Fatalf(...)
        /workspace/anago-v1.19.3-rc.0.69+37babbd0e76c11/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1456
k8s.io/kubernetes/cmd/kube-controller-manager/app.Run.func1(0x4a654c0, 0xc000a5a880)
        /workspace/anago-v1.19.3-rc.0.69+37babbd0e76c11/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kube-controller-manager/app/controllermanager.go:249 +0x68e
created by k8s.io/kubernetes/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
        /workspace/anago-v1.19.3-rc.0.69+37babbd0e76c11/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:208 +0x113

What you expected to happen:

A rogue windows node with a bad CIDR wouldnt break the KCM entirely, but rather, log the error and continue. No individual node’s metadata in a cluster should be sufficient to bring down the entire KCM.

How to reproduce it (as minimally and precisely as possible):

Setup a two node cluster, enabling node cidrs . Then set the node cidr to an invalid value (one which mismatches or is outside the range of the --cluster-cidr arg) , the entire kube-controller-manager will then go down.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.19

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 15 (12 by maintainers)

Most upvoted comments

yeah, kubectl edit the spec is enough to bring down the entire kcm 😃