elemental: Upgrade does not triggered (sometimes)

This issue has been spotted during the analyze regarding CI flaky tests (https://github.com/rancher/elemental/issues/383)

First of all, the issue is sporadic, sometimes tests pass and sometimes not.

What steps did you take and what happened: I have one Elemental cluster node installed through the Rancher UI. Purpose of the test is to validate the update of my cluster to this image: quay.io/costoolkit/elemental-ci:latest

To update my node, I create an OS Images Upgrades resource like this image image image

Once done I’m checking that the cluster status changes from Active to Updating state: image image

And this is where the issue happens, sometimes the upgrade is not triggered and the status stays Active and we hit the timeout in the test.

Issue in the CI https://github.com/rancher/elemental/actions/runs/3225014880

Video https://user-images.githubusercontent.com/6025636/195086990-4a0224b8-048c-45f4-a0cc-6d6ded377d0c.mp4

What did you expect to happen: The upgrade should be triggered each time.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.] I tested to increase timeouts in Cypress code and I also migrated to the latest Cypress version.

If I remember correctly, I was able to reproduce it manually one time and at the end, the upgrade was triggered but something like 20 minutes after OS Images Upgrades resource got created…

We do not see this issue happen in the backup e2e test and we also use ManagedOSImage there.

Environment:

  • Elemental release version (use cat /etc/os-release):
rancher-26437:~ # cat /etc/os-release 
NAME="SLE Micro"
VERSION="5.3"
VERSION_ID="5.3"
PRETTY_NAME="SUSE Linux Enterprise Micro for Rancher 5.3"
ID="sle-micro-rancher"
ID_LIKE="suse"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sle-micro-rancher:5.3"
IMAGE_REPO="registry.opensuse.org/isv/rancher/elemental/teal53/15.4/rancher/elemental-node-image/5.3"
IMAGE_TAG="20.14"
IMAGE="registry.opensuse.org/isv/rancher/elemental/teal53/15.4/rancher/elemental-node-image/5.3:20.14"
TIMESTAMP=20221004064711
GRUB_ENTRY_NAME="Elemental"
  • Rancher version: 2.6.8
  • Kubernetes version (use kubectl version): v1.23.6+k3s1
  • Cloud provider or hardware configuration: Libvirt

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 60 (60 by maintainers)

Most upvoted comments

Goddammit, found this on the downstream cluster fleet-agent logs:

time="2022-11-21T13:14:52Z" level=info msg="purge requested for mcc-myelementalcluster-managed-system-upgrade-c-d6312"

So…something is purging it?

After heavy tests this seems to come from the Jobs not selecting any nodes to schedule the pods somehow.

So when triggering a Os upgrade we create a Bundle, which has several things inside, accounts, bindings, secrets and a Plan.

A Chart is installed to deploy all those jobs (I think) and the plan gets deployed along with the rest of the stuff. Then a Job is created to run the upgrade pod, BUT this job seems to fail as it cannot schedule any pods anywhere. Now Im not sure why after 30 minutes, stuff is scheduled, still looking at that. Maybe there is a activeDeadlineSeconds that syncs with this and the job is killed by that time, then queued again? Need to check further

Ok, let me add something in the code to upload the elemental-operator log at the end of every tests.