fleet: Helm installation fails because of capability check

Hello,

Trying to install traefik on rancher v2.5.8 (fleet v0.3.5) fails in a Processed state when the supplied values requires a capability check.

The fleet.yaml looks like this:

defaultNamespace: traefik-experimental
helm:
  releaseName: traefik-experimental
  chart: https://internal-registry.example.com/repository/helm-proxy-traefik/traefik-9.18.2.tgz
  values:
    deployment:
      replicas: 2
    podDisruptionBudget:
      enabled: true
      minAvailable: 1
    globalArguments: []
    service:
      type: NodePort
    logs:
      general:
        level: INFO
      access:
        enabled: true
    ingressRoute:
      dashboard:
        annotations:
          kubernetes.io/ingress.class: experimental
    ingressClass:
      enabled: true

The error message when looking at the bundle in the UI is this:

template: traefik/templates/ingressclass.yaml:7:8: executing "traefik/templates/ingressclass.yaml" at <fail "\n\n ERROR: You must have atleast networking.k8s.io/v1beta1 to use ingressClass">: error calling fail: ERROR: You must have atleast networking.k8s.io/v1beta1 to use ingressClass

I have verified that I can install it manually using helm install so it is was not a problem with the version of the underlying kubernetes cluster version.

The problematic part is this:

    ingressClass:
     enabled: true

If I remove that part it works with fleet as well, and what is actually happening when setting it can be seen here: https://github.com/traefik/traefik-helm-chart/blob/v9.18.2/traefik/templates/ingressclass.yaml:

{{- if and .Values.ingressClass.enabled (semverCompare ">=2.3.0" (default .Chart.AppVersion .Values.image.tag)) -}}
 {{- if .Capabilities.APIVersions.Has "networking.k8s.io/v1/IngressClass" }}
apiVersion: networking.k8s.io/v1
 {{- else if .Capabilities.APIVersions.Has "networking.k8s.io/v1beta1/IngressClass" }}
apiVersion: networking.k8s.io/v1beta1
 {{- else }}
   {{- fail "\n\n ERROR: You must have atleast networking.k8s.io/v1beta1 to use ingressClass" }}
 {{- end }}

Looking at the helm code in fleet I have noticed the use of DefaultCapabilities here: https://github.com/rancher/fleet/blob/d21bb02b29ef15bdc689d65a2b2ac918d0d0cae9/pkg/helmdeployer/template.go#L27

Which, if I am understanding the go.mod file correctly in fleet v0.3.5, results in this data that only appears to include Core V1 (“v1”) based on the comment: https://github.com/rancher/helm/blob/v3.3.3-fleet1/pkg/chartutil/capabilities.go#L27-L41

IngressClass is not part of core v1 (https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.19/#ingressclass-v1-networking-k8s-io) and I am wondering if this might be the reason the check fails.

If I understand it correctly there is support in the helm library to fetch capabilities when calling Run() at https://github.com/rancher/fleet/blob/d21bb02b29ef15bdc689d65a2b2ac918d0d0cae9/pkg/helmdeployer/deployer.go#L336 but I am not sure if this is involved at the time the error above is thrown.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 42 (15 by maintainers)

Most upvoted comments

Digging around some more I searched in the code base for “Processed” since this is the state the bundle is stuck in, and found a reference in https://github.com/rancher/fleet/blob/v0.3.5/pkg/controllers/bundle/controller.go#L51-L59 which sounds reasonable given that it is a bundle that is stuck.

Looking at the BundleGeneratingHandler h.OnBundleChange lead me to https://github.com/rancher/fleet/blob/v0.3.5/pkg/controllers/bundle/controller.go#L118-L135

…where setResourceKey() is called, which ends up calling the Template() function where I found the chartutil.DefaultCapabilities usage above. https://github.com/rancher/fleet/blob/v0.3.5/pkg/controllers/bundle/controller.go#L157

So inspecting what Template() actually does (https://github.com/rancher/fleet/blob/v0.3.5/pkg/helmdeployer/template.go) it is a few intersting things:

  • Sets useGlobalCfg: true in the fleet custom helm struct.
  • h.globalCfg.Capabilities = chartutil.DefaultCapabilities

It then calls h.Deploy() which eventually calls h.install(), and this in turn creates an actual helm action cfg variable using h.getCfg(namespace, options.ServiceAccount): https://github.com/rancher/fleet/blob/v0.3.5/pkg/helmdeployer/deployer.go#L275

What does getCfg do? Well, it returns the previously configured h.globalCfg if h.useGlobalCfg is true, which was set to true by Template(): https://github.com/rancher/fleet/blob/v0.3.5/pkg/helmdeployer/deployer.go#L246-L248

We then create a helm install object from that cfg: https://github.com/rancher/fleet/blob/v0.3.5/pkg/helmdeployer/deployer.go#L316

… and now looking at the call to caps, err := i.cfg.getCapabilities() called in u.Run(): https://github.com/rancher/helm/blob/v3.3.3-fleet1/pkg/action/action.go#L236-L240

… since Capabilities is not nil (it was configured in Template()) it just returns those Capabilities with no further lookups. So now I feel fairly confident that we do not investigate any actual cluster capabilities, and this should be the reason fleet fails. It is if course possible I am missing something, unfortunately I have not been able to test my theory at this point.

Thanks for the replies.

I updated my fleet helm chart from 0.5.1 to 0.6.0 (that updated my fleet-controller as well, previously I only updated the fleet-agents) and now it works as expected.

helm -n cattle-fleet-system upgrade --create-namespace --wait --version 102.0.0+up0.6.0 fleet-crd https://github.com/rancher/fleet/releases/download/v0.6.0/fleet-crd-0.6.0.tgz

helm -n cattle-fleet-system upgrade --create-namespace --wait --version 102.0.0+up0.6.0 fleet https://github.com/rancher/fleet/releases/download/v0.6.0/fleet-0.6.0.tgz

And my fleet-controller configmap looks like this:

{
  "systemDefaultRegistry": "",
  "agentImage": "rancher/fleet-agent:v0.7.0-AGENT-rc.1",
  "agentImagePullPolicy": "IfNotPresent",
  "apiServerURL": "https://example.com",
  "apiServerCA": "",
  "agentCheckinInterval": "15m",
  "ignoreClusterRegistrationLabels": false,
  "bootstrap": {
    "paths": "",
    "repo": "",
    "secret": "",
    "branch":  "master",
    "namespace": "fleet-local",
    "agentNamespace": "cattle-fleet-local-system",
  },
  "webhookReceiverURL": "",
  "githubURLPrefix": ""
}

@manno where exactly did #1287 land ? In 0.5.x or in 0.6.x ? Or both ?

I accidentialy set the milestone to 2023-Q2-v2.6x instead of 2023-Q2-v2.7x. Fixed

@manno the fleet-agents are using 0.7.0-rc1, take a look at the configmap I provided.

I can confirm as well that installing cert-manager v1.11.0 with rancher/fleet-agent:v0.7.0-AGENT-rc.1 is working as expected

Thanks @janosmiko for looking into this. I double check with the team to understand why the issue still occur.

@manno where exactly did #1287 land ? In 0.5.x or in 0.6.x ? Or both ?

Hit the same issue with cert-manager 1.11.0 through fleet 0.5.1 in rancher 2.7.1…

@jdloft I agree. Previously the capabilities were checked on the upstream cluster instead of the downstream cluster and #1101 turned that into a warning.

However, additionally fleet seems to be using the “default capabilities” from the Helm SDK, which are hardcoded to “1.20”.

Note: we can cherry pick the code from https://github.com/rancher/fleet/pull/985/files

I believe this is fixed by #1101

Longhorn 1.4.0 Helm, fleet 0.5.0, downstream cluster version is 1.24.8:

Chart requires kubeVersion: >=1.21.0-0 which is incompatible with Kubernetes v1.20.0

…since they changed kubeVersion:

image

I suspect it will be fixed with some future rancher 2.7.x release?