machine-api-operator: MachineHealthCheck fights with MachineSet on invalid configuration

The enhancement doc for the Machine lifecycle states that a Machine will move to the Failed phase if there is an error in the Machine configuration that precludes trying to create a provider.

The MachineHealthCheck controller will immediately delete any Machine in the Failed phase. On platforms that put Machines into the Failed phase due to invalid configuration, this will result in a fight with the MachineSet controller, constantly creating and deleting Machines.

The following actuators are affected:

AWS
Azure
GCP
OpenStack
libvirt
oVirt
kubevirt
kubemark

The vSphere actuator won’t display this behaviour until #735 is merged. The baremetal (Metal³) actuator doesn’t currently return InvalidConfiguration errors, but this is planned for the future.

A solution to this might be for the MachineHealthCheck to not queue failed Machines for immediate deletion in the case where the ErrorReason is InvalidConfiguration, and instead only delete them after a timeout. One obstacle to this is that the ErrorReason is not currently recorded by the Machine controller. That will be fixed by #701.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 20 (20 by maintainers)

Most upvoted comments

Looks like this will be addressed by #814 (openshift/enhancements#673).

zaneb on Mar 19, 2021

@mshitrit This is relevant to the conversation we had last week

JoelSpeed on Jan 25, 2021

I’m wondering if the ProviderSpec can be invalid at some point of time but become valid after a while. For example, in the baremetal case, maybe the image url was not reachable at the time of machine creation, but later it would be reachable.

In this case I would have said that the actuator should not return an invalid configuration error if you think this is a non-terminal error and that instead, you should requeue after some time period. Is this something you’d expect to happen often?

The CAPBM actuator currently returns invalid conf error only if the fields are missing, so it matches your description.

n1r1 on Oct 26, 2020