machine-api-operator: MachineHealthCheck fights with MachineSet on invalid configuration
The enhancement doc for the Machine lifecycle states that a Machine will move to the Failed phase if there is an error in the Machine configuration that precludes trying to create a provider.
The MachineHealthCheck controller will immediately delete any Machine in the Failed phase. On platforms that put Machines into the Failed phase due to invalid configuration, this will result in a fight with the MachineSet controller, constantly creating and deleting Machines.
The following actuators are affected:
- AWS
- Azure
- GCP
- OpenStack
- libvirt
- oVirt
- kubevirt
- kubemark
The vSphere actuator won’t display this behaviour until #735 is merged. The baremetal (Metal³) actuator doesn’t currently return InvalidConfiguration errors, but this is planned for the future.
A solution to this might be for the MachineHealthCheck to not queue failed Machines for immediate deletion in the case where the ErrorReason is InvalidConfiguration, and instead only delete them after a timeout. One obstacle to this is that the ErrorReason is not currently recorded by the Machine controller. That will be fixed by #701.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 20 (20 by maintainers)
Looks like this will be addressed by #814 (openshift/enhancements#673).
@mshitrit This is relevant to the conversation we had last week
The CAPBM actuator currently returns invalid conf error only if the fields are missing, so it matches your description.