machine-api-operator: MachineHealthCheck fights with MachineSet on invalid configuration

The enhancement doc for the Machine lifecycle states that a Machine will move to the Failed phase if there is an error in the Machine configuration that precludes trying to create a provider.

The MachineHealthCheck controller will immediately delete any Machine in the Failed phase. On platforms that put Machines into the Failed phase due to invalid configuration, this will result in a fight with the MachineSet controller, constantly creating and deleting Machines.

The following actuators are affected:

  • AWS
  • Azure
  • GCP
  • OpenStack
  • libvirt
  • oVirt
  • kubevirt
  • kubemark

The vSphere actuator won’t display this behaviour until #735 is merged. The baremetal (Metal³) actuator doesn’t currently return InvalidConfiguration errors, but this is planned for the future.

A solution to this might be for the MachineHealthCheck to not queue failed Machines for immediate deletion in the case where the ErrorReason is InvalidConfiguration, and instead only delete them after a timeout. One obstacle to this is that the ErrorReason is not currently recorded by the Machine controller. That will be fixed by #701.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 20 (20 by maintainers)

Most upvoted comments

Looks like this will be addressed by #814 (openshift/enhancements#673).

@mshitrit This is relevant to the conversation we had last week

I’m wondering if the ProviderSpec can be invalid at some point of time but become valid after a while. For example, in the baremetal case, maybe the image url was not reachable at the time of machine creation, but later it would be reachable.

In this case I would have said that the actuator should not return an invalid configuration error if you think this is a non-terminal error and that instead, you should requeue after some time period. Is this something you’d expect to happen often?

The CAPBM actuator currently returns invalid conf error only if the fields are missing, so it matches your description.