jx: "Bootjob" task doesn't work in remote cluster
Hi!
I think there is an error in remote cluster configuration and jx operations. I used: Multicluster documentation and I created two kubernetes cluster in AWS. The process works with some repositories in the first (main) cluster but when I want a release of these repositories or I modified the jx-requirements in remote cluster, the bootjob tasks in remote cluster never finishes (it finished after 30min by timeout đ )
The finish log about this build:
+ echo 'viewing the git operator boot job log for commit sha: 937a6d0d45dd02d7bbe6bd944e58672e97ebeef5'
--
+ jx admin log --verbose --commit-sha 937a6d0d45dd02d7bbe6bd944e58672e97ebeef5
viewing the git operator boot job log for commit sha: 937a6d0d45dd02d7bbe6bd944e58672e97ebeef5
waiting for the Git Operator to be ready in namespace jx-git-operator...
pod jx-git-operator-75d5b6b469-2hkzp has status Ready
the Git Operator is running in pod jx-git-operator-75d5b6b469-2hkzp
--
waiting for boot Job pod with selector app=jx-boot in namespace jx-git-operator for commit SHA 937a6d0d45dd02d7bbe6bd944e58672e97ebeef5...
error: failed to wait for active Job in namespace jx-git-operator with selector app=jx-boot: timed out after waiting for duration 30m0s
--
Pipeline failed on stage 'from-build-pack' : container 'step-admin-log'. The execution of the pipeline has stopped.
- The
jx admin log
itâsSucceed
- The
jx health status --all-namespaces
itâsOK
- My
jx version
is3.2.127
- My kubernetes cluster version is
1.19
- My
jx gitops lint
ItâsOK
in two cluster repositories - My
jx helmfile resolve
doesnât mark nothing
Apparently, the boot job process doesnât see the pod in remote cluster. Do you have any ideas? Is it a bug? Thank you
Maybe it have relation with: https://github.com/jenkins-x/jx/issues/7761
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 15 (1 by maintainers)
I spent some time looking at this and I believe this boils down to two issues. Iâve been able to hack together a fix for this in my own environment (EKS). For simplicity purposes, iâll refer to the cluster running jx/tekton/lighthouse as the âdev clusterâ and the remote cluster as the âprod clusterâ:
jx-admin log
uses the k8s service account presented to the pod to look for the jx-boot pod. Since It ends up using the credentials to access the local k8s cluster, but the jx-boot pod itâs looking for actually runs on the remote cluster, it obviously never finds it and times out. To fix this problem, I created a custom step in my release pipeline that installs jx, aws, and kubectl clis. I pulled in the kubeconfig from aws (aws eks --region XXXX update-kubeconfig --name XXXX
), and run thejx admin log
against the commit sha (as seen in the admin-log step previously), but this correctly runs against the right cluster.old:
new:
After making these changes, iâve been able to merge changes to our production cluster and view the
jx admin log
in the build logs as expected. Iâm sure thereâs a better way to do this, but I figured I would share this in case anyone else would find it useful.Yes, I have the same error with my Multi-cluster setup.
bootjobs
triggered by the production environment repo start on the dev cluster and never finish, timing out after 30m. I believe the issues stems from the trigger.yaml file here.These failing builds then cause a 2nd issue when the GC jobs starts.
jx gitops gc activities
will throw and error as it tries to delete PipelineRun objects related to the âproductionâ build.Eitherway, it seems like
bootjobs
should not run for environments withremoteCluster: true
enabled.timed-out âproductionâ bootjob log example:
The problem here is not that the boot log does not work, the deployments are update and all other stuff works ( on my machines), but the âadmin-logâ step never finishes, also, there are no logs returned into the dashboard anyway so it just sits and waits.
So, what you should do to âfixâ this, is to remove the admin-log step from the release pipeline in the cluster repo, the file .lighthouse/jenkins-x/release.yaml looks like this
change it to
Thereâs a third piece of the pie here â You have to add an IAM role to allow the tekton-bot role to assume prod-cluster-role. Per discussion on jx slack, iâm planning to add some docs around all the steps needed for this: https://kubernetes.slack.com/archives/C9MBGQJRH/p1637685695225600
Hi I have the same issue on my multi cluster, when I merge PR on remote repository, on dev/staging cluster everything is working as expected. I have applied the solution proposed in the Multi-Cluster Example:
ââ ď¸ For cluster auto-update support both the Lighthouse and jxboot-helmfile-resources charts must be removed.â
But unfortunately it didnât help.