fleet: Helm installation fails because of capability check
Hello,
Trying to install traefik on rancher v2.5.8 (fleet v0.3.5) fails in a Processed
state when the supplied values requires a capability check.
The fleet.yaml looks like this:
defaultNamespace: traefik-experimental
helm:
releaseName: traefik-experimental
chart: https://internal-registry.example.com/repository/helm-proxy-traefik/traefik-9.18.2.tgz
values:
deployment:
replicas: 2
podDisruptionBudget:
enabled: true
minAvailable: 1
globalArguments: []
service:
type: NodePort
logs:
general:
level: INFO
access:
enabled: true
ingressRoute:
dashboard:
annotations:
kubernetes.io/ingress.class: experimental
ingressClass:
enabled: true
The error message when looking at the bundle in the UI is this:
template: traefik/templates/ingressclass.yaml:7:8: executing "traefik/templates/ingressclass.yaml" at <fail "\n\n ERROR: You must have atleast networking.k8s.io/v1beta1 to use ingressClass">: error calling fail: ERROR: You must have atleast networking.k8s.io/v1beta1 to use ingressClass
I have verified that I can install it manually using helm install
so it is was not a problem with the version of the underlying kubernetes cluster version.
The problematic part is this:
ingressClass:
enabled: true
If I remove that part it works with fleet as well, and what is actually happening when setting it can be seen here: https://github.com/traefik/traefik-helm-chart/blob/v9.18.2/traefik/templates/ingressclass.yaml:
{{- if and .Values.ingressClass.enabled (semverCompare ">=2.3.0" (default .Chart.AppVersion .Values.image.tag)) -}}
{{- if .Capabilities.APIVersions.Has "networking.k8s.io/v1/IngressClass" }}
apiVersion: networking.k8s.io/v1
{{- else if .Capabilities.APIVersions.Has "networking.k8s.io/v1beta1/IngressClass" }}
apiVersion: networking.k8s.io/v1beta1
{{- else }}
{{- fail "\n\n ERROR: You must have atleast networking.k8s.io/v1beta1 to use ingressClass" }}
{{- end }}
Looking at the helm code in fleet I have noticed the use of DefaultCapabilities here: https://github.com/rancher/fleet/blob/d21bb02b29ef15bdc689d65a2b2ac918d0d0cae9/pkg/helmdeployer/template.go#L27
Which, if I am understanding the go.mod file correctly in fleet v0.3.5, results in this data that only appears to include Core V1 (“v1”) based on the comment: https://github.com/rancher/helm/blob/v3.3.3-fleet1/pkg/chartutil/capabilities.go#L27-L41
IngressClass is not part of core v1 (https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.19/#ingressclass-v1-networking-k8s-io) and I am wondering if this might be the reason the check fails.
If I understand it correctly there is support in the helm library to fetch capabilities when calling Run() at https://github.com/rancher/fleet/blob/d21bb02b29ef15bdc689d65a2b2ac918d0d0cae9/pkg/helmdeployer/deployer.go#L336 but I am not sure if this is involved at the time the error above is thrown.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 42 (15 by maintainers)
Digging around some more I searched in the code base for “Processed” since this is the state the bundle is stuck in, and found a reference in https://github.com/rancher/fleet/blob/v0.3.5/pkg/controllers/bundle/controller.go#L51-L59 which sounds reasonable given that it is a bundle that is stuck.
Looking at the BundleGeneratingHandler
h.OnBundleChange
lead me to https://github.com/rancher/fleet/blob/v0.3.5/pkg/controllers/bundle/controller.go#L118-L135…where
setResourceKey()
is called, which ends up calling theTemplate()
function where I found the chartutil.DefaultCapabilities usage above. https://github.com/rancher/fleet/blob/v0.3.5/pkg/controllers/bundle/controller.go#L157So inspecting what
Template()
actually does (https://github.com/rancher/fleet/blob/v0.3.5/pkg/helmdeployer/template.go) it is a few intersting things:useGlobalCfg: true
in the fleet custom helm struct.h.globalCfg.Capabilities = chartutil.DefaultCapabilities
It then calls
h.Deploy()
which eventually callsh.install()
, and this in turn creates an actual helm action cfg variable usingh.getCfg(namespace, options.ServiceAccount)
: https://github.com/rancher/fleet/blob/v0.3.5/pkg/helmdeployer/deployer.go#L275What does getCfg do? Well, it returns the previously configured h.globalCfg if h.useGlobalCfg is true, which was set to true by
Template()
: https://github.com/rancher/fleet/blob/v0.3.5/pkg/helmdeployer/deployer.go#L246-L248We then create a helm install object from that cfg: https://github.com/rancher/fleet/blob/v0.3.5/pkg/helmdeployer/deployer.go#L316
… and now looking at the call to
caps, err := i.cfg.getCapabilities()
called inu.Run()
: https://github.com/rancher/helm/blob/v3.3.3-fleet1/pkg/action/action.go#L236-L240… since Capabilities is not nil (it was configured in
Template()
) it just returns those Capabilities with no further lookups. So now I feel fairly confident that we do not investigate any actual cluster capabilities, and this should be the reason fleet fails. It is if course possible I am missing something, unfortunately I have not been able to test my theory at this point.Thanks for the replies.
I updated my fleet helm chart from 0.5.1 to 0.6.0 (that updated my fleet-controller as well, previously I only updated the fleet-agents) and now it works as expected.
And my fleet-controller configmap looks like this:
I accidentialy set the milestone to 2023-Q2-v2.6x instead of 2023-Q2-v2.7x. Fixed
@manno the fleet-agents are using 0.7.0-rc1, take a look at the configmap I provided.
I can confirm as well that installing cert-manager v1.11.0 with rancher/fleet-agent:v0.7.0-AGENT-rc.1 is working as expected
Thanks @janosmiko for looking into this. I double check with the team to understand why the issue still occur.
@manno where exactly did #1287 land ? In 0.5.x or in 0.6.x ? Or both ?
Hit the same issue with cert-manager 1.11.0 through fleet 0.5.1 in rancher 2.7.1…
@jdloft I agree. Previously the capabilities were checked on the upstream cluster instead of the downstream cluster and #1101 turned that into a warning.
However, additionally fleet seems to be using the “default capabilities” from the Helm SDK, which are hardcoded to “1.20”.
Note: we can cherry pick the code from https://github.com/rancher/fleet/pull/985/files
Longhorn 1.4.0 Helm, fleet 0.5.0, downstream cluster version is 1.24.8:
Chart requires kubeVersion: >=1.21.0-0 which is incompatible with Kubernetes v1.20.0
…since they changed kubeVersion:
I suspect it will be fixed with some future rancher 2.7.x release?