strimzi-kafka-operator: [Bug] unexpected removing all kafka resources when upgrade using helm3

Describe the bug I’m using strimzi operator v0.19.0 and tried upgrade to 0.20.0. When I’ve ran helm upgrade procedure all my resources (users, topics, clusters) was removed. I try to reproduce problem with fresh installed cluster and situation was reproduced again.

To Reproduce Steps to reproduce the behavior:

1. helm install strimzi-kafka strimzi/strimzi-kafka-operator --namespace kafka --set watchNamespaces="{kafka,test-kafka}" --version=0.19.0
2. create cluster, users and topics from manifests (apiVersion: v1beta1)
3. helm upgrade strimzi-kafka strimzi/strimzi-kafka-operator --namespace kafka --set watchNamespaces="{kafka,test-kafka}"
kubectl get crd| grep kafka| wc -l
       0

After the steps above my cluster and users/topics was removed. The operator pod try to start and crashed with the following error:

2020-10-26 14:35:47 WARN  WatchConnectionManager:198 - Exec Failure: HTTP 404, Status: 404 - 404 page not found

java.net.ProtocolException: Expected HTTP 101 response but was '404 Not Found'
	at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229) [com.squareup.okhttp3.okhttp-3.12.6.jar:?]
	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196) [com.squareup.okhttp3.okhttp-3.12.6.jar:?]
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203) [com.squareup.okhttp3.okhttp-3.12.6.jar:?]
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) [com.squareup.okhttp3.okhttp-3.12.6.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]
2020-10-26 14:35:47 WARN  WatchConnectionManager:198 - Exec Failure: HTTP 404, Status: 404 - 404 page not found

java.net.ProtocolException: Expected HTTP 101 response but was '404 Not Found'
	at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229) [com.squareup.okhttp3.okhttp-3.12.6.jar:?]
	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196) [com.squareup.okhttp3.okhttp-3.12.6.jar:?]
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203) [com.squareup.okhttp3.okhttp-3.12.6.jar:?]
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) [com.squareup.okhttp3.okhttp-3.12.6.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]

Expected behavior The operator should be updated without removing resources.

Environment (please complete the following information):

  • Strimzi version: 0.19.0
  • Installation method: Helm chart
  • Kubernetes cluster: v.1.18.8
  • Infrastructure: Rancher2 on Amazon EC2 instances

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 4
  • Comments: 33 (10 by maintainers)

Commits related to this issue

Most upvoted comments

A quick workaround we found with our team:

  1. Backup and delete all secrets related to Strimzi helm chart sh.helm.release.v1.strimzi
  2. Redeploy with version 0.20.0, CRD’s will not be deleted and version upgrade completes successfully

A quick workaround we found with our team:

  1. Backup and delete all secrets related to Strimzi helm chart sh.helm.release.v1.strimzi
  2. Redeploy with version 0.20.0, CRD’s will not be deleted and version upgrade completes successfully

There is also the option to edit the data in the helm secret instead of deleting it

kubectl get secrets -n NAMESPACE sh.helm.release.v1.DEPLOYNAME -o json | jq .data.release -r | base64 --decode | base64 --decode | gunzip - > /var/tmp/manifest.json

The remove the CRD data inside the templates and manifest sections and upload the secret again

DATA=`cat /var/tmp/manifest.json | gzip -c | base64 | base64`
kubectl patch secret -n NAMESPACE sh.helm.release.v1.DEPLOYNAME --type='json' -p="[ {\"op\":\"replace\",\"path\":\"/data/release\",\"value\":\"$DATA\"}]"

Then the upgrade to 0.20.0 will leave the CRDs alone…

Yes, that’s the problem of removing the CRD from the yaml manifest. Helm not longer controls what to do with them.

I’ve just hit the same issue on k3s single-node deployment. The CRDs seem to be managed by the helm chart so I guess there’s something off there. Below the upgrade logs from the helm-operator:

ts=2020-10-27T13:32:19.36621305Z caller=helm.go:69 component=helm version=v3 info="performing update for strimzi-kafka-operator" targetNamespace=strimzi release=strimzi-kafka-operator
ts=2020-10-27T13:32:19.614669544Z caller=helm.go:69 component=helm version=v3 info="dry run for strimzi-kafka-operator" targetNamespace=strimzi release=strimzi-kafka-operator
ts=2020-10-27T13:32:20.324093381Z caller=helm.go:69 component=helm version=v3 info="performing update for cert-manager" targetNamespace=cert-manager release=cert-manager
ts=2020-10-27T13:32:20.649044703Z caller=helm.go:69 component=helm version=v3 info="dry run for cert-manager" targetNamespace=cert-manager release=cert-manager
ts=2020-10-27T13:32:22.166658114Z caller=release.go:311 component=release release=strimzi-kafka-operator targetNamespace=strimzi resource=strimzi:helmrelease/strimzi-kafka-operator helmVersion=v3 info="no changes" phase=dry-run-compare
ts=2020-10-27T13:32:27.105220407Z caller=release.go:311 component=release release=cert-manager targetNamespace=cert-manager resource=cert-manager:helmrelease/cert-manager helmVersion=v3 info="no changes" phase=dry-run-compare
ts=2020-10-27T13:32:28.273798311Z caller=release.go:311 component=release release=prometheus-operator targetNamespace=monitoring resource=monitoring:helmrelease/prometheus-operator helmVersion=v3 info="no changes" phase=dry-run-compare
ts=2020-10-27T13:35:08.512402419Z caller=release.go:79 component=release release=strimzi-kafka-operator targetNamespace=strimzi resource=strimzi:helmrelease/strimzi-kafka-operator helmVersion=v3 info="starting sync run"
ts=2020-10-27T13:35:11.184987118Z caller=release.go:353 component=release release=strimzi-kafka-operator targetNamespace=strimzi resource=strimzi:helmrelease/strimzi-kafka-operator helmVersion=v3 info="running upgrade" action=upgrade
ts=2020-10-27T13:35:11.218170987Z caller=helm.go:69 component=helm version=v3 info="preparing upgrade for strimzi-kafka-operator" targetNamespace=strimzi release=strimzi-kafka-operator
ts=2020-10-27T13:35:11.265187003Z caller=helm.go:69 component=helm version=v3 info="resetting values to the chart's original version" targetNamespace=strimzi release=strimzi-kafka-operator
ts=2020-10-27T13:35:11.549234831Z caller=helm.go:69 component=helm version=v3 info="performing update for strimzi-kafka-operator" targetNamespace=strimzi release=strimzi-kafka-operator
ts=2020-10-27T13:35:11.677293827Z caller=helm.go:69 component=helm version=v3 info="creating upgraded release for strimzi-kafka-operator" targetNamespace=strimzi release=strimzi-kafka-operator
ts=2020-10-27T13:35:11.873954568Z caller=helm.go:69 component=helm version=v3 info="checking 13 resources for changes" targetNamespace=strimzi release=strimzi-kafka-operator
ts=2020-10-27T13:35:11.889156124Z caller=helm.go:69 component=helm version=v3 info="Created a new ConfigMap called \"strimzi-cluster-operator\" in strimzi\n" targetNamespace=strimzi release=strimzi-kafka-operator
ts=2020-10-27T13:35:11.970217468Z caller=helm.go:69 component=helm version=v3 info="Deleting \"kafkas.kafka.strimzi.io\" in ..." targetNamespace=strimzi release=strimzi-kafka-operator
ts=2020-10-27T13:35:12.076143097Z caller=helm.go:69 component=helm version=v3 info="Deleting \"kafkaconnects.kafka.strimzi.io\" in ..." targetNamespace=strimzi release=strimzi-kafka-operator
ts=2020-10-27T13:35:12.118525008Z caller=helm.go:69 component=helm version=v3 info="Deleting \"kafkaconnects2is.kafka.strimzi.io\" in ..." targetNamespace=strimzi release=strimzi-kafka-operator
ts=2020-10-27T13:35:12.138900256Z caller=helm.go:69 component=helm version=v3 info="Deleting \"kafkatopics.kafka.strimzi.io\" in ..." targetNamespace=strimzi release=strimzi-kafka-operator
ts=2020-10-27T13:35:12.144809518Z caller=helm.go:69 component=helm version=v3 info="Deleting \"kafkausers.kafka.strimzi.io\" in ..." targetNamespace=strimzi release=strimzi-kafka-operator
ts=2020-10-27T13:35:12.152097065Z caller=helm.go:69 component=helm version=v3 info="Deleting \"kafkamirrormakers.kafka.strimzi.io\" in ..." targetNamespace=strimzi release=strimzi-kafka-operator
ts=2020-10-27T13:35:12.16910867Z caller=helm.go:69 component=helm version=v3 info="Deleting \"kafkabridges.kafka.strimzi.io\" in ..." targetNamespace=strimzi release=strimzi-kafka-operator
ts=2020-10-27T13:35:12.189448458Z caller=helm.go:69 component=helm version=v3 info="Deleting \"kafkaconnectors.kafka.strimzi.io\" in ..." targetNamespace=strimzi release=strimzi-kafka-operator
ts=2020-10-27T13:35:12.19472487Z caller=helm.go:69 component=helm version=v3 info="Deleting \"kafkamirrormaker2s.kafka.strimzi.io\" in ..." targetNamespace=strimzi release=strimzi-kafka-operator
ts=2020-10-27T13:35:12.216209334Z caller=helm.go:69 component=helm version=v3 info="Deleting \"kafkarebalances.kafka.strimzi.io\" in ..." targetNamespace=strimzi release=strimzi-kafka-operator
ts=2020-10-27T13:35:12.425014421Z caller=helm.go:69 component=helm version=v3 info="updating status for upgraded release for strimzi-kafka-operator" targetNamespace=strimzi release=strimzi-kafka-operator
ts=2020-10-27T13:35:12.606613657Z caller=release.go:364 component=release release=strimzi-kafka-operator targetNamespace=strimzi resource=strimzi:helmrelease/strimzi-kafka-operator helmVersion=v3 info="upgrade succeeded" revision=0.20.0

Afterward, there’s no CRDs to be found but it’s weird that helm doesn’t throw any errors and straight off starts removing all kafka components. So I think this is more for the helm chart maintainers than anything else and these upgrades have to be thoroughly tested as no one wants to inadvertently kill their entire kafka clusters when upgrading the operator.

I just encountered the same situation. I’m running AWS EKS with Kubernetes 1.18 and when running update using helm, all the Strimzi CRDs were removed and not installed back.