argo-cd: Certificate resources from cert-manager v1 fails / syncing becomes stuck

If you are trying to resolve an environment-specific issue or have a one-off question about the edge case that does not require a feature then please consider asking a question in argocd slack channel.

Checklist:

  • I’ve searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I’ve included steps to reproduce the bug.
  • I’ve pasted the output of argocd version.

Describe the bug

When creating a Certificate from cert-manager (v1), and using a sync-wave or hook, the sync immediately fails: with Issuing certificate as Secret does not exist. This causes ArgoCD to retry the sync 7 hours later (?):

Running a few seconds ago (Sat Sep 05 2020 03:30:53 GMT-0700)
one or more synchronization tasks completed unsuccessfully. Retrying attempt #1 at 10:28AM.

We’re not able to resync when using pre-sync. When using sync-wave, if we terminate the sync and sync again it works.

Do we need to create a custom healthcheck or something?

To Reproduce

Create a Certificate with the hook or sync-wave annotation:

# fails immediately and unable to sync even when retrying
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  annotations:
    argocd.argoproj.io/hook: PreSync
...

# sync remains stuck
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "-1"
...

Expected behavior

ArgoCD should ignore the secretName does not exist error and continue syncing when using hook or sync-wave annotations.

Version

argocd: v1.7.4+f8cbd6b
  BuildDate: 2020-09-05T02:46:53Z
  GitCommit: f8cbd6bf432327cc3b0f70d23b66511bb906a178
  GitTreeState: clean
  GoVersion: go1.14.1
  Compiler: gc
  Platform: darwin/amd64
argocd-server: v1.7.4+f8cbd6b
  BuildDate: 2020-09-05T02:45:44Z
  GitCommit: f8cbd6bf432327cc3b0f70d23b66511bb906a178
  GitTreeState: clean
  GoVersion: go1.14.1
  Compiler: gc
  Platform: linux/amd64
  Ksonnet Version: v0.13.1
  Kustomize Version: {Version:kustomize/v3.6.1 GitCommit:c97fa946d576eb6ed559f17f2ac43b3b5a8d5dbd BuildDate:2020-05-27T20:47:35Z GoOs:linux GoArch:amd64}
  Helm Version: version.BuildInfo{Version:"v3.3.1", GitCommit:"249e5215cde0c3fa72e27eb7a30e8d55c9696144", GitTreeState:"clean", GoVersion:"go1.14.7"}
  Kubectl Version: v1.17.8

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 19 (1 by maintainers)

Most upvoted comments

guys, i do use this

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cm
  namespace: argocd
  labels:
    app.kubernetes.io/name: argocd-cm
    app.kubernetes.io/part-of: argocd
data:
  resource.customizations.health.cert-manager.io_Certificate: |
    hs = {}
    if obj.status ~= nil then
      if obj.status.conditions ~= nil then
        for i, condition in ipairs(obj.status.conditions) do
          if condition.type == "Ready" and condition.status == "False" and condition.reason ~= "DoesNotExist" then
            hs.status = "Degraded"
            hs.message = condition.message
            return hs
          end
          if condition.type == "Ready" and condition.status == "True" and condition.reason ~= "DoesNotExist" then
            hs.status = "Healthy"
            hs.message = condition.message
            return hs
          end
        end
      end
    end
    hs.status = "Progressing"
    hs.message = "Waiting for certificate"
    return hs

but still

Name:               devops
Project:            chania-project
Server:             https://kubernetes.default.svc
Namespace:          devops
URL:                http://localhost:34295/applications/devops
Repo:               https://github.com/hlacik/gitops-chania.git
Target:             chania-2
Path:               devops
SyncWindow:         Sync Allowed
Sync Policy:        Automated (Prune)
Sync Status:        OutOfSync from chania-2 (45ddafb)
Health Status:      Healthy

GROUP                KIND           NAMESPACE     NAME                 STATUS     HEALTH   HOOK  MESSAGE
                     Secret         cert-manager  vault-approle        Synced                    
                     Secret         devops        registry-pcr-docker  Synced                    
                     Service        kube-system   traefik-apn          Synced     Healthy        
cert-manager.io      Certificate    devops        carpc-tls            OutOfSync  Healthy        
cert-manager.io      ClusterIssuer                vault-issuer         Synced                    
traefik.containo.us  TLSOption      devops        default              Synced                    
traefik.containo.us  TLSStore       devops        default              Synced      

Certificate is OutOfSync , using latest cert-manager v1.10.0

@hlacikd @CryptoTr4der @L-U-C-K-Y I had a thrashing issue, and I solved it with the below; try it and let me know.

Under configs.cm.resource.customizations, set the value:

cert-manager.io/Certificate:
          health.lua: |
            hs = {}
            if obj.status ~= nil then
              if obj.status.conditions ~= nil then
                for i, condition in ipairs(obj.status.conditions) do
                  if condition.type == "Ready" and condition.status == "False" then
                    hs.status = "Degraded"
                    hs.message = condition.message
                    return hs
                  end
                  if condition.type == "Ready" and condition.status == "True" then
                    hs.status = "Healthy"
                    hs.message = condition.message
                    return hs
                  end
                end
              end
            end

            hs.status = "Progressing"
            hs.message = "Waiting for certificate"
            return hs

I was getting the same thing, can confirm this worked for me, Argo CD v2.7.4, cert-manager v1.12.2

@hlacikd @CryptoTr4der @L-U-C-K-Y I had a thrashing issue, and I solved it with the below; try it and let me know.

Under configs.cm.resource.customizations, set the value:

cert-manager.io/Certificate:
          health.lua: |
            hs = {}
            if obj.status ~= nil then
              if obj.status.conditions ~= nil then
                for i, condition in ipairs(obj.status.conditions) do
                  if condition.type == "Ready" and condition.status == "False" then
                    hs.status = "Degraded"
                    hs.message = condition.message
                    return hs
                  end
                  if condition.type == "Ready" and condition.status == "True" then
                    hs.status = "Healthy"
                    hs.message = condition.message
                    return hs
                  end
                end
              end
            end

            hs.status = "Progressing"
            hs.message = "Waiting for certificate"
            return hs

Experiencing the same thing as @hlacikd

@aliusmiles sad to see u gone. @jessesuen would you be able to help? Any help will be much appreciated.

Argo version: Tried both : “1.8.7” My Certificate config has got helm hooks configured

  annotations:
    helm.sh/hook: pre-install
    helm.sh/hook-weight: '-5'

Argo is overriding the same with Argo sync hooks as per document. The provisioning fails with error “Issuing certificate as Secret does not exist” . All the other Deployment resources are blocked on this since Certificate provisioning is failing as Failed or Degraded.

Certificate config

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  annotations:
    helm.sh/hook: pre-install
    helm.sh/hook-weight: '-5'
  labels:
    app.kubernetes.io/instance: storage
  name: storage-cert
  namespace: atom-storage
spec:
  commonName: storage.atom-storage
  issuerRef:
    kind: ClusterIssuer
    name: commoncaissuer
  privateKey:
    algorithm: RSA
    encoding: PKCS1
    rotationPolicy: Always
    size: 2048
  renewBefore: 360h0m0s
  secretName: storage-atom-storage-tls
  subject:
    organizations:
      - xxxxx

Deployment config

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/name: storage
  name: storage
  namespace: atom-storage
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/instance: storage
      app.kubernetes.io/name: storage
  template:
    metadata:
      labels:
        app.kubernetes.io/instance: storage
        app.kubernetes.io/name: storage
    spec:
      containers:
        - args:
            - '--config=/etc/storage/config/config.json'
            - '--level=info'
          image: >-
            storage:1.2.0-g1d2e72a864
          imagePullPolicy: IfNotPresent

          name: storage
          ports:
            - containerPort: 8080
              name: https
              protocol: TCP
          volumeMounts:
            - mountPath: /etc/storage/config/certs
              name: tls-certs
            - mountPath: /etc/storage/config
              name: config
      volumes:
        - name: tls-certs
          secret:
            secretName: storage-atom-storage-tls
        - configMap:
            name: storage-config
          name: config

Health check applied for certificate

  resource.customizations: |
    cert-manager.io/Certificate:
      health.lua: |
        hs = {}
        if obj.status ~= nil then
          if obj.status.conditions ~= nil then
            for i, condition in ipairs(obj.status.conditions) do
              if condition.type == "Ready" and condition.status == "False" and obj.metadata.namespace ~= "external-dns" then
                hs.status = "Degraded"
                hs.message = condition.message
                return hs
              end
              if condition.type == "Ready" and condition.status == "True" then
                hs.status = "Healthy"
                hs.message = condition.message
                return hs
              end
            end
          end
        end

        hs.status = "Progressing"
        hs.message = "Waiting for certificate"
        return hs

I also noticed a similar error being spoken of in issue #1826 . It seemed to have moved as fixed. I still see the issue. Am I missing some thing?

The issue got resolved with moving to ArgoCD 2.0

seems to work for me with this https://argoproj.github.io/argo-cd/operator-manual/health/#way-1-define-a-custom-health-check-in-argocd-cm-configmap but need to do some deployments to confirm – not sure if custom healthcheck helped or me terminating sync