longhorn: [BUG] Some taint tolerations can not be set, but Longhorn should run on all nodes

Describe the bug It is not possible to set all possible taint tolerations. Note that tools like RKE set taints, e.g. node-role.kubernetes.io/etcd=true:NoExecute or node-role.kubernetes.io/controlplane=true:NoSchedule. The description in the Longhorn UI says

Because kubernetes.io is used as the key of all Kubernetes default tolerations, it should not be used in the toleration settings.

This suggests that such tolerations can be set - but in fact, it is not allowed at all. And my understanding from other bug reports is that Longhorn should be deployed on all nodes (see e.g. https://github.com/longhorn/longhorn/issues/1633#issuecomment-702967887) because a pod using a Longhorn volume could be deployed anywhere (and have special tolerations). Imho this is contradictory.

To Reproduce Steps to reproduce the behavior:

Go to the Longhorn UI
Set a toleration like node-role.kubernetes.io/controlplane=true:NoSchedule, try to save changes
Validation fails with

fail to set settings with invalid taint-toleration: value node-role.kubernetes.io/controlplane=true:NoSchedule of settings taint-toleration is invalid: the value of taint-toleration is invalid: the key of Longhorn toleration setting cannot contain “kubernetes.io” since this substring is considered as the key of Kubernetes default tolerations

Expected behavior

allow taint toleration for all taints that can exist
also allow wildcard tolerations - see https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#concepts (“There are two special cases” box)
(potentially) improve description in Longhorn UI

Log

Environment:

Longhorn version: 1.0.2
Kubernetes version: 1.18.9
Node OS type and version: Ubuntu 18.04

Additional context See https://github.com/longhorn/longhorn-manager/blob/55c8453c22b0f47da54bb6922e0ac97c37154b75/upgrade/v1alpha1/types/setting.go#L439 (validation rule and code for parsing)

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 15 (13 by maintainers)

Most upvoted comments

@PhanLe1010 Not sure why the upgrade process can fail due to conflict (since there should be only one manager at the time of upgrade). Can you look into it?

@khushboo-rancher can you file an issue so we can track it?

yasker on Nov 30, 2020

OK, it seems we can partly solve this problem by adding one more annotation regarding which toleration is added by Longhorn. Though if both Longhorn and Kubernetes add the same toleration, then we might still accidentally remove it. I guess that would be fine for now.

yasker on Oct 27, 2020

In fact, the workload shouldn’t be deployed when node-role.kubernetes.io/controlplane=true:NoSchedule is set. Only the worker nodes which can run workload needs to install Longhorn. We don’t need to install Longhorn on control/etcd nodes. So we don’t expect the user to set those taint tolerations.

yasker on Oct 14, 2020