jx: "Bootjob" task doesn't work in remote cluster

Hi!

I think there is an error in remote cluster configuration and jx operations. I used: Multicluster documentation and I created two kubernetes cluster in AWS. The process works with some repositories in the first (main) cluster but when I want a release of these repositories or I modified the jx-requirements in remote cluster, the bootjob tasks in remote cluster never finishes (it finished after 30min by timeout 😉 )

image image

The finish log about this build:

+ echo 'viewing the git operator boot job log for commit sha: 937a6d0d45dd02d7bbe6bd944e58672e97ebeef5'
--
+ jx admin log --verbose --commit-sha 937a6d0d45dd02d7bbe6bd944e58672e97ebeef5
viewing the git operator boot job log for commit sha: 937a6d0d45dd02d7bbe6bd944e58672e97ebeef5
waiting for the Git Operator to be ready in namespace jx-git-operator...
pod jx-git-operator-75d5b6b469-2hkzp has status Ready
the Git Operator is running in pod jx-git-operator-75d5b6b469-2hkzp
--
waiting for boot Job pod with selector app=jx-boot in namespace jx-git-operator for commit SHA 937a6d0d45dd02d7bbe6bd944e58672e97ebeef5...
error: failed to wait for active Job in namespace jx-git-operator with selector app=jx-boot: timed out after waiting for duration 30m0s
--
Pipeline failed on stage 'from-build-pack' : container 'step-admin-log'. The execution of the pipeline has stopped.
  1. The jx admin log it’s Succeed
  2. The jx health status --all-namespaces it’s OK
  3. My jx version is 3.2.127
  4. My kubernetes cluster version is 1.19
  5. My jx gitops lint It’s OK in two cluster repositories
  6. My jx helmfile resolve doesn’t mark nothing

Apparently, the boot job process doesn’t see the pod in remote cluster. Do you have any ideas? Is it a bug? Thank you

Maybe it have relation with: https://github.com/jenkins-x/jx/issues/7761

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 15 (1 by maintainers)

Most upvoted comments

I spent some time looking at this and I believe this boils down to two issues. I’ve been able to hack together a fix for this in my own environment (EKS). For simplicity purposes, i’ll refer to the cluster running jx/tekton/lighthouse as the “dev cluster” and the remote cluster as the “prod cluster”:

  1. jx-admin log uses the k8s service account presented to the pod to look for the jx-boot pod. Since It ends up using the credentials to access the local k8s cluster, but the jx-boot pod it’s looking for actually runs on the remote cluster, it obviously never finds it and times out. To fix this problem, I created a custom step in my release pipeline that installs jx, aws, and kubectl clis. I pulled in the kubeconfig from aws (aws eks --region XXXX update-kubeconfig --name XXXX), and run the jx admin log against the commit sha (as seen in the admin-log step previously), but this correctly runs against the right cluster.
  2. The aws-auth configmap in the prod cluster has the mapRoles field that is populated with the role created by the jx tf configurations. However, for a request to call from one EKS cluster to another, the production cluster’s configmap must have the dev cluster’s tekton-bot role mapped to do this.

old:

- "groups":
  - "system:bootstrappers"
  - "system:nodes"
  "rolearn": "arn:aws:iam::*:role/prod-cluster-role"

new:

- "groups":
  - "system:bootstrappers"
  - "system:nodes"
  "rolearn": "arn:aws:iam::*:role/prod-cluster-role"
- "groups":
  - "system:masters"
  "rolearn": "arn:aws:iam::*:role/dev-cluster-tekton-bot"
  "username": "tekton-bot"

After making these changes, i’ve been able to merge changes to our production cluster and view the jx admin log in the build logs as expected. I’m sure there’s a better way to do this, but I figured I would share this in case anyone else would find it useful.

Yes, I have the same error with my Multi-cluster setup.

~ jx version
version: 3.2.202

bootjobs triggered by the production environment repo start on the dev cluster and never finish, timing out after 30m. I believe the issues stems from the trigger.yaml file here.

These failing builds then cause a 2nd issue when the GC jobs starts. jx gitops gc activities will throw and error as it tries to delete PipelineRun objects related to the ‘production’ build.

Eitherway, it seems like bootjobs should not run for environments with remoteCluster: true enabled.

apiVersion: core.jenkins-x.io/v4beta1
kind: Requirements
spec:
  environments:
  - key: jx3-prod-cluster
    gitKind: github
    gitServer: https://github.com
    owner: company
    promotionStrategy: Manual
    remoteCluster: true
    repository: jx3-prod-cluster

timed-out ‘production’ bootjob log example:

Showing logs for build "company/jx3-prod-cluster/main #6" stage "from-build-pack" and container "place-tools"
2021/10/11 23:12:53 Copied /ko-app/entrypoint to /tekton/tools/entrypoint

Showing logs for build "company/jx3-prod-cluster/main #6" stage "from-build-pack" and container "place-scripts"
2021/10/11 23:12:54 Decoded script /tekton/scripts/script-0-4b987
2021/10/11 23:12:54 Decoded script /tekton/scripts/script-1-hlq8t

Showing logs for build "company/jx3-prod-cluster/main #6" stage "from-build-pack" and container "working-dir-initializer"

Showing logs for build "company/jx3-prod-cluster/main #6" stage "from-build-pack" and container "step-git-clone"
git cloning url: https://github.com/company/jx3-prod-cluster.git version main@dbf314683069dbe797a3ef2915afe179d90bb0a3 to dir: source
Cloning into 'source'...
HEAD is now at dbf3146 chore: regenerated
checked out revision: main@dbf314683069dbe797a3ef2915afe179d90bb0a3 to dir: source

Showing logs for build "company/jx3-prod-cluster/main #6" stage "from-build-pack" and container "step-admin-log"
viewing the git operator boot job log for commit sha: dbf314683069dbe797a3ef2915afe179d90bb0a3
waiting for the Git Operator to be ready in namespace jx-git-operator...

WARNING: the git operator pod has failed but will restart
to view the log of the failed git operator pod run: kubectl logs -n jx-git-operator jx-boot-b3a7cf68-6d83-4735-a0e9-d16a66b5f1d2-sqc6c


WARNING: the git operator pod has failed but will restart
to view the log of the failed git operator pod run: kubectl logs -n jx-git-operator jx-boot-b3a7cf68-6d83-4735-a0e9-d16a66b5f1d2-sm5v4


WARNING: the git operator pod has failed but will restart
to view the log of the failed git operator pod run: kubectl logs -n jx-git-operator jx-boot-b3a7cf68-6d83-4735-a0e9-d16a66b5f1d2-gv8vb

pod jx-git-operator-864cbb4d9-r7rwt has status Ready
the Git Operator is running in pod jx-git-operator-864cbb4d9-r7rwt

waiting for boot Job pod with selector app=jx-boot in namespace jx-git-operator for commit SHA dbf314683069dbe797a3ef2915afe179d90bb0a3...
error: failed to wait for active Job in namespace jx-git-operator with selector app=jx-boot: timed out after waiting for duration 30m0s
"
Pipeline failed on stage 'from-build-pack' : container 'step-admin-log'. The execution of the pipeline has stopped."

The problem here is not that the boot log does not work, the deployments are update and all other stuff works ( on my machines), but the “admin-log” step never finishes, also, there are no logs returned into the dashboard anyway so it just sits and waits.

So, what you should do to “fix” this, is to remove the admin-log step from the release pipeline in the cluster repo, the file .lighthouse/jenkins-x/release.yaml looks like this

        steps:
        - image: uses:jenkins-x/jx3-pipeline-catalog/tasks/git-clone/git-clone.yaml@versionStream
          name: ""
          resources: {}
        - name: admin-log
          resources:

change it to

        steps:
        - image: uses:jenkins-x/jx3-pipeline-catalog/tasks/git-clone/git-clone.yaml@versionStream
          name: ""
          resources: {}

There’s a third piece of the pie here – You have to add an IAM role to allow the tekton-bot role to assume prod-cluster-role. Per discussion on jx slack, i’m planning to add some docs around all the steps needed for this: https://kubernetes.slack.com/archives/C9MBGQJRH/p1637685695225600

Hi I have the same issue on my multi cluster, when I merge PR on remote repository, on dev/staging cluster everything is working as expected. I have applied the solution proposed in the Multi-Cluster Example:

“⚠️ For cluster auto-update support both the Lighthouse and jxboot-helmfile-resources charts must be removed.”

But unfortunately it didn’t help.

Screenshot 2021-10-28 at 12 13 35