cluster-api: MHC shouldn't remediate the reported terminal failures

User Story

As an Operator I would like MHC to avoid doing remediation on some specific failureReason and failureMessage that are terminal states, to avoid spamming the infrastructure and confuse the users.

Detailed Description

some Infrastructure have some terminal states where deleting and re-creating won’t change the outcome (e.g. not enough quota, insufficient resources etc…). A solution to this would to allow t matching against failureReason and/or failureMessage(same mechanism as the taint/toleration) matching to select which failures we want to remediate against

Anything else you would like to add:

cc @vincepri @JoelSpeed @fabriziopandini

[Miscellaneous information that will assist in solving the issue.]

/kind feature

About this issue

Original URL
State: open
Created 4 years ago
Comments: 25 (20 by maintainers)

Most upvoted comments

Just read the issue to catch up on context. @JoelSpeed’s proposal seems useful for the specific use case he describes above, ie. “If using a machine is backed by an AWS spot instance, except the maximum spot price is too low for the current price, it will fail to launch”, but it doesn’t seem to solve the original use case described in the issue, ie. the idea of a “global terminal failure” where no amount of retrying is going to fix it. For example, the OS image provided doesn’t exist, or the kubeadm config has an invalid value. In this case we wouldn’t want to retry at all, even periodically with a backoff, since it won’t magically fix itself unless the user intervenes.

That being said, can you expand on why nodeStartupTimeout doesn’t cover the same use case for machines that have a failure coming up? Wouldn’t the nodeStartupTimeout also apply to Machines that “failed” to provision?

CecileRobertMichon on Nov 16, 2021

@yastij @CecileRobertMichon This seems a behavioral breaking change that we should try making into v0.4.0 if possible

vincepri on Jun 7, 2021