argo-cd: waiting for completion of hook and hook never succeds

HI,

We are seeing this issue quite often where app sync is getting stuck in “waiting for completion of hook” and these hooks are never getting completed

As you can see the below application got stuck on secret creation phase and some how that secret never got created

image

Stripped out all un-necessary details. Now this is how the secret is created and used by the job.

apiVersion: v1
kind: Secret
metadata:
  name: {{ include "xxx.fullname" . }}-migrations-{{ .Chart.AppVersion }}
  annotations:
    helm.sh/hook: pre-install,pre-upgrade
    helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded
    helm.sh/hook-weight: "-5"
type: Opaque
data:
  xxxx

apiVersion: batch/v1
kind: Job
  annotations:
    helm.sh/hook: pre-install,pre-upgrade
    helm.sh/hook-delete-policy: before-hook-creation
    helm.sh/hook-weight: "-4"
spec:
      annotations:
        sidecar.istio.io/inject: "false"
    spec:
        - name: app-settings
          configMap:
            name: {{ include "xxx.fullname" . }}-migrations-{{ .Chart.AppVersion }}
        - name: app-secrets
          secret:
            secretName: {{ include "xxx.fullname" . }}-migrations-{{ .Chart.AppVersion }}

kubectl -n argocd logs argocd-server-768f46f469-j98h6 | grep xxx-migrations - No matching logs kubectl -n argocd logs argocd-repo-server-57bdbf899c-9lxhr | grep xxx-migrations - No matching logs kubectl -n argocd logs argocd-repo-server-57bdbf899c-7xvs7 | grep xxx-migrations - No matching logs kubectl -n argocd logs argocd-server-768f46f469-tqp8p | grep xxx-migrations - No matching logs

[testadmin@server0 ~]$ kubectl -n argocd logs argocd-application-controller-0 | grep orchestrator-migrations time=“2021-08-02T02:16:25Z” level=info msg=“Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494” application=xxx time=“2021-08-02T02:16:25Z” level=info msg=“Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494” application=xxx time=“2021-08-02T02:19:25Z” level=info msg=“Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494” application=xxx time=“2021-08-02T02:19:26Z” level=info msg=“Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494” application=xxx time=“2021-08-02T02:22:17Z” level=info msg=“Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494” application=xxx time=“2021-08-02T02:22:17Z” level=info msg=“Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494” application=xxx time=“2021-08-02T02:22:25Z” level=info msg=“Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494” application=xxx time=“2021-08-02T02:25:25Z” level=info msg=“Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494” application=xxx time=“2021-08-02T02:25:25Z” level=info msg=“Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494” application=xxx time=“2021-08-02T02:28:25Z” level=info msg=“Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494” application=xxx time=“2021-08-02T02:28:26Z” level=info msg=“Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494” application=xxx time=“2021-08-02T02:31:25Z” level=info msg=“Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494” application=xxx time=“2021-08-02T02:31:26Z” level=info msg=“Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494” application=xxx

Environment:

  • 3 Node RKE2 Cluster
  • OS: RHEL 8.4
  • K8’s setup on Azure VM’s

ArgoCD Version: 2.0.1

Please let me know in case of any other info required

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 24
  • Comments: 42 (4 by maintainers)

Most upvoted comments

I have the same problem in version 2.2.0.

Hello Argo community 😃

I am fairly familiar with ArgoCD codebase and API, and I’d happily try to repay you for building such an awesome project by trying to have a stab at this issue, if there are no objections?

I just figured out what was causing Argo to freeze on the hook. In my case the specific hook had ttlSecondsAfterFinished: 0 defined in the spec. Through Kustomize I removed this field:

patches:
  - target:
      name: pre-hook
      kind: Job
    path: patches/hook.yaml
# patches/hook.yaml
- path: "/spec/ttlSecondsAfterFinished"
  op: remove

Afterwards the chart finally went through! It’s still a bug that should be addressed, I’m just sharing this for others to work around it.

I’m seeing this issue with v2.6.1+3f143c9

We started experiencing this issue after upgrading to 2.3.3. Before that we were on 2.2.3. I am not 100% sure but I do not recall we had any issue with 2.2.3.

I can confirm the error was fixed on 2.0.3. We recently upgraded to 2.3.3 and we are experiencing the error again.

I just figured out what was causing Argo to freeze on the hook. In my case the specific hook had ttlSecondsAfterFinished: 0 defined in the spec. Through Kustomize I removed this field:

patches:
  - target:
      name: pre-hook
      kind: Job
    path: patches/hook.yaml
# patches/hook.yaml
- path: "/spec/ttlSecondsAfterFinished"
  op: remove

Afterwards the chart finally went through! It’s still a bug that should be addressed, I’m just sharing this for others to work around it.

@boedy You’re a Saint. I’ve been staring at envoyproxy/gateway for two weeks.

Looking at what appears to be the same issue on ArgoCD v2.6.7. I’ve killed all controller pods, server and repo server at the same time, hoping that ArgoCD would start behaving. But to no avail. I believe the reason for it to start behaving like this is in the first place was a ImagePullBackOff on the Job image.

We had to completely exclude all jobs from argo cd via resource exclusion global config: https://argo-cd.readthedocs.io/en/stable/operator-manual/declarative-setup/#resource-exclusioninclusion

And we migrated all jobs on repo to cronjobs with suspend: true BUT fair warninig, Due to a k8s bug sometimes cronjobs may be triggered when changing spec, INCLUDING changing suspend: false to suspend: true - yes it’s stupid like that…

I think it’s this one, but there are others as well… https://github.com/kubernetes/kubernetes/issues/63371

Same issue in v1.3.3.7 and also in version v6.9… This issue was opened on Aug 2, 2021, we are now at 2023, please bump this comment via emoji so I can see it in my inbox in 2042

in all seriousness, still happens at version 2.7.7

We also had this issue and it was resolved once we set ARGOCD_CONTROLLER_REPLICAS.

Instructions here: https://argo-cd.readthedocs.io/en/stable/operator-manual/high_availability/#argocd-application-controller

If the controller is managing too many clusters and uses too much memory then you can shard clusters across multiple controller replicas. To enable sharding increase the number of replicas in argocd-application-controller StatefulSet and repeat number of replicas in ARGOCD_CONTROLLER_REPLICAS environment variable. The strategic merge patch below demonstrates changes required to configure two controller replicas.

We’re seeing a similar issue on the syncfailed hook which means we can’t actually terminate the sync action.

The job doesn’t exist in the target namespace, and we’ve tried to trick argo by creating a job with the same name, namespace, and annotations as we’d expect to see with a simple echo "done’ action but nothing is helping.

image

ArgoCD Version;

{“Version”:“v2.3.4+ac8b7df”,“BuildDate”:“2022-05-18T11:41:37Z”,“GitCommit”:“ac8b7df9467ffcc0920b826c62c4b603a7bfed24”,“GitTreeState”:“clean”,“GoVersion”:“go1.17.10”,“Compiler”:“gc”,“Platform”:“linux/amd64”,“KsonnetVersion”:“v0.13.1”,“KustomizeVersion”:“v4.4.1 2021-11-11T23:36:27Z”,“HelmVersion”:“v3.8.0+gd141386”,“KubectlVersion”:“v0.23.1”,“JsonnetVersion”:“v0.18.0”}

@alexmt - We are using the below version of ArgoCD and seeing the same issue with Contour helm. Application is waiting for PreSync Job to complete whereas on a cluster I can see the job is completed.

{ “Version”: “v2.1.3+d855831”, “BuildDate”: “2021-09-29T21:51:21Z”, “GitCommit”: “d855831540e51d8a90b1006d2eb9f49ab1b088af”, “GitTreeState”: “clean”, “GoVersion”: “go1.16.5”, “Compiler”: “gc”, “Platform”: “linux/amd64”, “KsonnetVersion”: “v0.13.1”, “KustomizeVersion”: “v4.2.0 2021-06-30T22:49:26Z”, “HelmVersion”: “v3.6.0+g7f2df64”, “KubectlVersion”: “v0.21.0”, “JsonnetVersion”: “v0.17.0” }

I suspect this is fixed by https://github.com/argoproj/argo-cd/pull/6294 . The fix is available in https://github.com/argoproj/argo-cd/releases/tag/v2.0.3 . Can you try upgrading please?