elemental: Upgrade does not triggered (sometimes)
This issue has been spotted during the analyze regarding CI flaky tests (https://github.com/rancher/elemental/issues/383)
First of all, the issue is sporadic, sometimes tests pass and sometimes not.
What steps did you take and what happened:
I have one Elemental cluster node installed through the Rancher UI.
Purpose of the test is to validate the update of my cluster to this image: quay.io/costoolkit/elemental-ci:latest
To update my node, I create an OS Images Upgrades resource like this
Once done I’m checking that the cluster status changes from Active to Updating state:
And this is where the issue happens, sometimes the upgrade is not triggered and the status stays Active and we hit the timeout in the test.
Issue in the CI https://github.com/rancher/elemental/actions/runs/3225014880
Video https://user-images.githubusercontent.com/6025636/195086990-4a0224b8-048c-45f4-a0cc-6d6ded377d0c.mp4
What did you expect to happen: The upgrade should be triggered each time.
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.] I tested to increase timeouts in Cypress code and I also migrated to the latest Cypress version.
If I remember correctly, I was able to reproduce it manually one time and at the end, the upgrade was triggered but something like 20 minutes after OS Images Upgrades
resource got created…
We do not see this issue happen in the backup e2e test and we also use ManagedOSImage there.
Environment:
- Elemental release version (use
cat /etc/os-release
):
rancher-26437:~ # cat /etc/os-release
NAME="SLE Micro"
VERSION="5.3"
VERSION_ID="5.3"
PRETTY_NAME="SUSE Linux Enterprise Micro for Rancher 5.3"
ID="sle-micro-rancher"
ID_LIKE="suse"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sle-micro-rancher:5.3"
IMAGE_REPO="registry.opensuse.org/isv/rancher/elemental/teal53/15.4/rancher/elemental-node-image/5.3"
IMAGE_TAG="20.14"
IMAGE="registry.opensuse.org/isv/rancher/elemental/teal53/15.4/rancher/elemental-node-image/5.3:20.14"
TIMESTAMP=20221004064711
GRUB_ENTRY_NAME="Elemental"
- Rancher version: 2.6.8
- Kubernetes version (use
kubectl version
): v1.23.6+k3s1 - Cloud provider or hardware configuration: Libvirt
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 60 (60 by maintainers)
Goddammit, found this on the downstream cluster fleet-agent logs:
time="2022-11-21T13:14:52Z" level=info msg="purge requested for mcc-myelementalcluster-managed-system-upgrade-c-d6312"
So…something is purging it?
After heavy tests this seems to come from the Jobs not selecting any nodes to schedule the pods somehow.
So when triggering a Os upgrade we create a Bundle, which has several things inside, accounts, bindings, secrets and a Plan.
A Chart is installed to deploy all those jobs (I think) and the plan gets deployed along with the rest of the stuff. Then a Job is created to run the upgrade pod, BUT this job seems to fail as it cannot schedule any pods anywhere. Now Im not sure why after 30 minutes, stuff is scheduled, still looking at that. Maybe there is a
activeDeadlineSeconds
that syncs with this and the job is killed by that time, then queued again? Need to check furtherOk, let me add something in the code to upload the elemental-operator log at the end of every tests.