longhorn: [BUG] Some taint tolerations can not be set, but Longhorn should run on all nodes
Describe the bug
It is not possible to set all possible taint tolerations. Note that tools like RKE set taints, e.g. node-role.kubernetes.io/etcd=true:NoExecute or node-role.kubernetes.io/controlplane=true:NoSchedule.
The description in the Longhorn UI says
Because kubernetes.io is used as the key of all Kubernetes default tolerations, it should not be used in the toleration settings.
This suggests that such tolerations can be set - but in fact, it is not allowed at all. And my understanding from other bug reports is that Longhorn should be deployed on all nodes (see e.g. https://github.com/longhorn/longhorn/issues/1633#issuecomment-702967887) because a pod using a Longhorn volume could be deployed anywhere (and have special tolerations). Imho this is contradictory.
To Reproduce Steps to reproduce the behavior:
- Go to the Longhorn UI
- Set a toleration like
node-role.kubernetes.io/controlplane=true:NoSchedule, try to save changes - Validation fails with
fail to set settings with invalid taint-toleration: value node-role.kubernetes.io/controlplane=true:NoSchedule of settings taint-toleration is invalid: the value of taint-toleration is invalid: the key of Longhorn toleration setting cannot contain “kubernetes.io” since this substring is considered as the key of Kubernetes default tolerations
Expected behavior
- allow taint toleration for all taints that can exist
- also allow wildcard tolerations - see https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#concepts (“There are two special cases” box)
- (potentially) improve description in Longhorn UI
Log
Environment:
- Longhorn version: 1.0.2
- Kubernetes version: 1.18.9
- Node OS type and version: Ubuntu 18.04
Additional context See https://github.com/longhorn/longhorn-manager/blob/55c8453c22b0f47da54bb6922e0ac97c37154b75/upgrade/v1alpha1/types/setting.go#L439 (validation rule and code for parsing)
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (13 by maintainers)
@PhanLe1010 Not sure why the upgrade process can fail due to conflict (since there should be only one manager at the time of upgrade). Can you look into it?
@khushboo-rancher can you file an issue so we can track it?
OK, it seems we can partly solve this problem by adding one more annotation regarding which toleration is added by Longhorn. Though if both Longhorn and Kubernetes add the same toleration, then we might still accidentally remove it. I guess that would be fine for now.
In fact, the workload shouldn’t be deployed when
node-role.kubernetes.io/controlplane=true:NoScheduleis set. Only the worker nodes which can run workload needs to install Longhorn. We don’t need to install Longhorn on control/etcd nodes. So we don’t expect the user to set those taint tolerations.