system-upgrade-controller: Spawned jobs use broken (truncated) image names
Version system-upgrade-controller version v0.9.1 (79da9f0)
Platform/Architecture linux-amd64
Describe the bug The jobs spawned by SUC use an image name truncated at 64 characters. This leads to Kubernetes not being able to run the jobs since these images simply do not exist.
To Reproduce
We experienced the bug with the following Plan. Please note that I cannot post company internal image and plan names so I had to replace them. Either way, the images are not publicly available. Please also note that we have configured a registry mirror to avoid having to specify an explicit registry. The key point for reproduction is that the image name must be longer than 64 characters. In our case, the image names are that long because we are using a deeply nested structure in GitLab to organize our code base.
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: xxxxxx
spec:
nodeSelector:
matchExpressions:
- key: node-role.kubernetes.io/master
operator: In
values:
- "true"
upgrade:
image: aa/bbbb/ccccccccccccccccc/dddddddddddddddddd/eeeeeeeeee/xxxxxx/master
command: ["/usr/local/bin/bash"]
args: ["/root/install.sh"]
concurrency: 1
version: "981a1f90cfcad63ea864957f996cde9b2e99e587"
SUC will pick up this plan and spawn a job to execute it. Kubernetes will then spawn pods accordingly which will fail due to the “truncated” image not existing. I was not able to describe the job or its pods because SUC seems to constantly replace the job. Using the configmap, I was able to enable debug mode though (SYSTEM_UPGRADE_CONTROLLER_DEBUG: true) and it seems to then log the job spec as JSON. This reveals the truncated image name in spec.template.spec.containers[0].image: aa/bbbb/ccccccccccccccccc/dddddddddddddddddd/eeeeeeeeee/xxxxxx/m
Expected behavior
I expect the spawned jobs to use the correct image name. In this example, this would be aa/bbbb/ccccccccccccccccc/dddddddddddddddddd/eeeeeeeeee/xxxxxx/master:981a1f90cfcad63ea864957f996cde9b2e99e587.
Actual behavior
The jobs use an image named truncated at 64 characters: aa/bbbb/ccccccccccccccccc/dddddddddddddddddd/eeeeeeeeee/xxxxxx/m.
Additional context None - but I’m happy to provide more information if required.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 2
- Comments: 15 (6 by maintainers)
Seems to be stuck in a loop and the jobs never get created.
The
not founderrors for both the plan and job resources seems to indicate that some of the informer caches aren’t getting started properly? If I delete the SUC pod, the next controller pod that comes up creates the job successfully: