cluster-api: MHC shouldn't remediate the reported terminal failures
User Story
As an Operator I would like MHC to avoid doing remediation on some specific failureReason and failureMessage that are terminal states, to avoid spamming the infrastructure and confuse the users.
Detailed Description
some Infrastructure have some terminal states where deleting and re-creating won’t change the outcome (e.g. not enough quota, insufficient resources etc…). A solution to this would to allow t matching against failureReason and/or failureMessage(same mechanism as the taint/toleration) matching to select which failures we want to remediate against
Anything else you would like to add:
cc @vincepri @JoelSpeed @fabriziopandini
[Miscellaneous information that will assist in solving the issue.]
/kind feature
About this issue
- Original URL
- State: open
- Created 4 years ago
- Comments: 25 (20 by maintainers)
Just read the issue to catch up on context. @JoelSpeed’s proposal seems useful for the specific use case he describes above, ie. “If using a machine is backed by an AWS spot instance, except the maximum spot price is too low for the current price, it will fail to launch”, but it doesn’t seem to solve the original use case described in the issue, ie. the idea of a “global terminal failure” where no amount of retrying is going to fix it. For example, the OS image provided doesn’t exist, or the kubeadm config has an invalid value. In this case we wouldn’t want to retry at all, even periodically with a backoff, since it won’t magically fix itself unless the user intervenes.
That being said, can you expand on why
nodeStartupTimeoutdoesn’t cover the same use case for machines that have a failure coming up? Wouldn’t thenodeStartupTimeoutalso apply to Machines that “failed” to provision?@yastij @CecileRobertMichon This seems a behavioral breaking change that we should try making into v0.4.0 if possible