helm: Helm is inconsistent about failed deployments

This is basically a generalized version of the bug #2257, possibly other filed issues and a failure mode that I’ve encountered today.

It seems that helm does not care if for some unknown class of reasons a deployment fails. Helm will report the failure but then happily false-successfully re-deploy the same chart, without changing the actual state but registering a new state in a history config map (revision).

There are two ways to trigger helm’s schizophrenia:

  1. #2257 1a. Install a chart that contains a property that is yet unknown to helm, e.g. envFrom before helm v2.4.0: true success 1b. Upgrade the release with a change in its chart that may or may not be related to that property, thus triggering the patch code path: failure 1c. Upgrade the release again without any change: the cluster is in an undefined state and helm lies about it. 😦

  2. 2a. Install a chart that is both valid and fully supported by helm. My case includes a ClusterRoleBinding: deployment is successful, obviously. 2b. Change the ClusterRoleBinding using an invalid {kind: ServiceAccount, subjects.name:}. Deployment fails as it should. 2c. Deploy the same chart again w/o any changes since 2b. This deployment falsely succeeds and helm is in the same schizophrenic state as in case 1.

/cc @tback @jayme-github

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 7
  • Comments: 21 (9 by maintainers)

Commits related to this issue

Most upvoted comments

@technosophos this seems to repeatably demonstrate the out-of-sync issue. https://github.com/rsanders/helm-2437-test-case

Okay. Thanks. I’m currently looking into the following two options:

  1. If a release fails, don’t allow an upgrade without first doing a rollback to a known-good version
  2. If a validation fails, see if we can avoid storing the new manifest (e.g. force a diff against the last known valid version). This one is harder than it sounds, and may just introduce another set of bugs.

And perhaps there are some other options that I haven’t thought of.

We are also getting hit by something similar to this. Our helm upgrade fails for something like this:

         Deployment.extensions "efritin-efritin-image-service-thrift" is invalid: spec.template.metadata.labels: Invalid value: {"app":"efritin-efritin","chart":"efritin-0.1.0-rc6","component":"image_service","release":"efritin","source":
"kviberg","stage":"production","tier":"thrift"}: `selector` does not match template `labels`

Running helm upgrade again with the same chart and release “succeeds”, but nothing changes on the K8s resources themselves. In our case all future upgrades are communicated incorrectly, so once the upgrade fails it seems that it cannot be correctly upgraded again.

IIUC, some variant of 2b. we see Helm/Tiller get out of sync with K8s actual state. It’s not envFrom, and it’s not silently immutable fields in Deployment or Daemonset. It has occurred in several different cases when the K8S API server rejecting a patches due to failed validations. Helm seems to barrel on assuming that the patch succeeded, and recording the desired state as actual.

For example, from the Tiller log:

2017/05/16 01:22:37 client.go:251: error updating the resource "service-proxy":
         Deployment.apps "service-proxy" is invalid: spec.template.spec.containers[0].ports[1].protocol: Unsupported value: "HTTP": supported values: TCP, UDP
...
2017/05/16 01:22:38 release_server.go:329: warning: Upgrade "RELEASENAME" failed: Deployment.apps "service-proxy" is invalid: spec.template.spec.containers[0].ports[1].protocol: Unsupported value: "HTTP": supported values: TCP, UDP
2017/05/16 01:22:38 storage.go:59: Updating "RELEASENAME" (v47) in storage
2017/05/16 01:22:38 storage.go:51: Create release "RELEASENAME" (v48) in storage

After that, there were some other changes – in this case I believe to the “image” and “env” of one of the containers – that didn’t occur, but were apparently recorded as being successful. Future upgrades after correcting the protocol to ‘TCP’ did not update the image and env contents to the new values.

I don’t have a minimal test case put together yet, but I do have the full Tiller log and the “helm get” of the Tiller release configmaps from before and after the failure.

This occurred with Tiller v2.4.1, Helm client v2.3.1 on K8S 1.6.3.