argo-cd: Failing to deploy services due to "helm dependency build" failure

Checklist:

  • I’ve searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I’ve included steps to reproduce the bug.
  • I’ve pasted the output of argocd version.

Describe the bug

Some Applications based on helm fail to deploy due to somekind of internal filesystem issue.

For example, one of the apps that are in Unknown states

Name:               redis-rule-hit-count-buffer-test-2
Project:            default
Server:             https://kubernetes.default.svc
Namespace:          test-2
URL:                https://35.232.222.64/applications/redis-rule-hit-count-buffer-test-2
Repo:               git@github.com:gc-org/gc-saas-prod.git
Target:             HEAD
Path:               prod/common/infra/redis
SyncWindow:         Sync Allowed
Sync Policy:        Automated (Prune)
Sync Status:        Unknown
Health Status:      Healthy

CONDITION        MESSAGE                                                                                                                                                                                                                                                  LAST TRANSITION
ComparisonError  rpc error: code = Unknown desc = Manifest generation error (cached): `helm dependency build` failed exit status 1: Error: unable to move current charts to tmp dir: link error: cannot rename charts to tmpcharts: rename charts tmpcharts: file exists  2020-12-22 10:14:56 +0200 IST

This doesn’t eventually resolve itself, it’s stays this way…

To Reproduce

I’m not sure how to reproduce, this happens from time to time and causes complete deadlock My Chart.yaml

name: redis
version: 0.1.0
apiVersion: v2
dependencies:
  - name: redis
    version: 11.2.2
    repository: https://charts.bitnami.com/bitnami

My app-of-apps

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: redis-servers-test-2
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: default
  source:
    repoURL: git@github.com:gc-org/gc-saas-prod.git
    targetRevision: HEAD
    path: prod/cluster_1/customers/customer_2/redis
    helm:
      releaseName: redis-servers-test-2
  destination:
    server: https://kubernetes.default.svc
    namespace: test-2
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
      allowEmpty: true

My template

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: redis-dashboard-top-widgets-cache-test-2
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
  annotations:
    argocd.argoproj.io/sync-wave: "3"
spec:
  project: default
  source:
    repoURL: git@github.com:gc-org/gc-saas-prod.git
    targetRevision: HEAD
    path: prod/common/infra/redis
    helm:
      values: |
        redis:
          fullnameOverride: redis-dashboard-top-widgets-cache
          redisPort: 6413
          master:
            nodeSelector:
              cus: test-2
            service:redis-rule-hit-count-buffer-master
              port: 6413
          image:
            tag: 6.0.6
          cluster:
            enabled: false
          existingSecret: "redis"
          existingSecretPasswordKey: redis-password
      releaseName: redis-dashboard-top-widgets-cache
  destination:
    server: https://kubernetes.default.svc
    namespace: test-2
  syncPolicy:
    syncOptions:
    - CreateNamespace=true
    retry:
      limit: -1 # unlimited
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 5m
    automated:
      prune: true
      selfHeal: true
      allowEmpty: true

Expected behavior

The Application should be deployed successfully

Screenshots

Screen Shot 2020-12-22 at 10 37 37

Version

argocd: v1.7.10+bcb05b0.dirty
  BuildDate: 2020-11-21T00:34:29Z
  GitCommit: bcb05b0c2e0f8006aa2d2abaf780e73c9e73c945
  GitTreeState: dirty
  GoVersion: go1.15.5
  Compiler: gc
  Platform: darwin/amd64
argocd-server: v1.8.1+c2547dc
  BuildDate: 2020-12-10T02:59:21Z
  GitCommit: c2547dca95437fdbb4d1e984b0592e6b9110d37f
  GitTreeState: clean
  GoVersion: go1.14.12
  Compiler: gc
  Platform: linux/amd64
  Ksonnet Version: v0.13.1
  Kustomize Version: v3.8.1 2020-07-16T00:58:46Z
  Helm Version: v3.4.1+gc4e7485
  Kubectl Version: v1.17.8

Logs

Logs from the argocd-application-controller

time="2020-12-22T08:14:58Z" level=info msg="Normalized app spec: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2020-12-22T08:14:56Z\",\"message\":\"rpc error: code = Unknown desc = Manifest generation error (cached): `helm dependency build` failed exit status 1: Error: unable to move current charts to tmp dir: link error: cannot rename charts to tmpcharts: rename charts tmpcharts: file exists\",\"type\":\"ComparisonError\"}]}}" application=redis-rule-hit-count-buffer-test-2

Logs from argocd-repo-server

time="2020-12-22T08:40:44Z" level=error msg="finished unary call with code Unknown" error="Manifest generation error (cached): `helm dependency build` failed exit status 1: Error: unable to move current charts to tmp dir: link error: cannot rename charts to tmpcharts: rename charts tmpcharts: file exists" grpc.code=Unknown grpc.method=GenerateManifest grpc.request.deadline="2020-12-22T08:41:43Z" grpc.service=repository.RepoServerService grpc.start_time="2020-12-22T08:40:44Z" grpc.time_ms=474.433 span.kind=server system=grpc
time="2020-12-22T08:40:44Z" level=info msg="manifest cache hit: &ApplicationSource{RepoURL:git@github.com:gc-org/gc-saas-prod.git,Path:prod/common/infra/redis,TargetRevision:HEAD,Helm:&ApplicationSourceHelm{ValueFiles:[],Parameters:[]HelmParameter{},ReleaseName:,Values:redis:\n  redisPort: 6380\n  master:\n    service:\n      port: 6380\n  image:\n    tag: 6.0.6\n  nodeSelector:\n    cus: test-1\n  cluster:\n    enabled: false\n  existingSecret: \"redis\"\n  existingSecretPasswordKey: redis-password\n,FileParameters:[]HelmFileParameter{},Version:,},Kustomize:nil,Ksonnet:nil,Directory:nil,Plugin:nil,Chart:,}/cead2aa7818699b6c3ef04fc7e35390ee0fcbee0"

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 26
  • Comments: 73 (5 by maintainers)

Commits related to this issue

Most upvoted comments

I believe another fix for this may be to update your stable repo in the argocd-cm ConfigMap as follows:

data:
  helm.repositories: |
    - url: https://charts.helm.sh/stable
      name: stable

After this, refresh your Argo applications.

We too are experiencing this quite often. Our workaround is to just run redis-cli flushall in the Redis pod.

We finally resolved this by increasing

  • --repo-server-timeout-seconds on the application-controller
  • ARGOCD_EXEC_TIMEOUT on the repo-server
argocd:
  controller:
    extraArgs:
      - --repo-server-timeout-seconds
      - "500" 
  repoServer:
    env:
      - name: "ARGOCD_EXEC_TIMEOUT"
        value: "5m"

Doesn’t make me happy, but until helm dep update is thread-safe(https://github.com/helm/helm/pull/8846#issuecomment-768479847), it’ll have to do.

Quick note on this one:

ComparisonError: rpc error: code = DeadlineExceeded desc = context deadline exceeded

That’s a generic error message from the golang context package, and it just means “a timeout happened somewhere.”

We’ve made efforts recently to always wrap all error messages to provide more context. Hopefully in future versions the reason for the timeout will be much more clear.

these are two terrible problems after working with argocd two years from version 1.7.x ~ 2.3.x, especially in an urgent deploy case.

  • ComparisonError: rpc error: code = Unknown desc = Manifest generation error (cached): helm dependency build failed timeout after 1m30s
  • ComparisonError: rpc error: code = DeadlineExceeded desc = context deadline exceeded

most time, I know it is not a performance issue after I followed everything from the official doc[high availability], it is just a cache problem. during these 2 years, there is still only a workaround to fix it:recreate argocd-repo-server and flush all data in redis cluster. thanks God!

Seeing as this issue is “resolved” temporarily - by killing repo-server pods, so they get recreated - its clearly a caching problem.

repo-server and redis pods …

After this week security upgrade ArgoCD to v2.2.5 (from v2.2.2 in my case) repoServer started to give this “helm dependency build” failure. Even v2.2.2 already includes helm 3.7 I’ve increased the timeouts and added a 2nd, 3rd and 4th replica but the repoServer starts to eat all cpu. When rolling back to v2.2.2 this issue disappears. It works for a while, even forcing manually sync 220 apps, but after some hours start filing with the “help dependency build” failure.

I’m running v1.7.6 as is - I’m going to give updating to 1.7.11 the old college try and report back.

I was able to upgrade to 1.7.11 - but it had no effect.

I ended up applying the following configuration through a custom values.yaml file:

argocd:
  server:
    config:
      repositories: |
        [...]
        - type: helm
          name: stable
          url: https://charts.helm.sh/stable
        [...]

After this, I had to helm template [...] | kubectl apply [...] for it to kick in, since ArgoCD is non-functional.

Yes, I’m using helm chart to deploy ArgoCD. Using last chart I saw 3.33.5 (5 Feb 2022) Using this chart version and rolling back to v2.2.2 also fix the issue. The previous chart version I was using was 3.29.5

Just try out the 3.29.5 and specify image tag 2.2.5, it will work fine. We were trying to identify the problem with chart causing that and we have some clues, but no confirmation so far

The below example, AFAIK, is helm v3 correct ? apiVersion: v2 is for helm 3 while apiVersion: v1 is for helm 2 - correct ?

apiVersion: v2
name: secrets-app
description: App that contains secrets
type: application
version: 0.1.0

Helm 3.7.1 is merged in argocd master and should be part of next release (2.1.7) https://github.com/argoproj/argo-cd/commit/2770c690a5597fcbab344cd2ad494c918472bdd1

Seeing as this issue is “resolved” temporarily - by killing repo-server pods, so they get recreated - its clearly a caching problem. My guess is that reposerver runs Helm in parallel on SAME pod - which Helm does not actually support - and hence once you hit 2 simultaneous Helm runs on same pod - Helm breaks and never recovers from that (due to caching). A simple “lock” around Helm - ensuring it is not run simultaenously, should confirm this as the issue (and resolve it 😃

Just hit this in v2.0.1 also… when doing recovery test… so syncing many (10+) applications (all helm applications) reproduces it.

@gzur unfortunately I am not certain. If you are using Helm 2 and have suddenly started seeing apps in an Unkown state with a helm dependency build failure then it may be worth attempting the solution I proposed. It worked for us on ArgoCD v1.6.1 and it’s only a minor change.

I’m running v1.7.6 as is - I’m going to give updating to 1.7.11 the old college try and report back.

Might be because of this: https://helm.sh/blog/new-location-stable-incubator-charts/ - the address for stable repo has changed.

@alexmt maybe it would be good to have a 1.7.12 with helm 2.17 included, that has the new address of stable updated in the binary, otherwise upgrade to 1.8 is mandatory

@zonnie this is related to the fact that Helm recently changed the location of the stable repo as mentioned by @lcostea, above. I believe the Helm 2 binary is now failing to resolve/access the old stable repo. By updating the ConfigMap, you override the stable repo to point at the new URL with the new location. We were having the same issue in our cluster. We made this change and noted that it solved the issue when we refreshed our apps.

Thanks @JasP19 I will give it a shot as I can