external-secrets: Upgrading from v1alpha1 to v1beta1 via Flux Kustomization causes dry-run failure warnings
Upgrading ExternalSecrets of apiVersion v1alpha1 into v1beta1 via flux2 Kustomization object would cause repeating warnings reported by Flux kustomization controller, which also cause Kustomization reconciliation to randomly fail:
Warning ReconciliationFailed 5m16s (x14 over 85m) kustomize-controller ExternalSecret/some-namespace/some-name dry-run failed, error: failed to prune fields: failed add back owned items: failed to convert pruned object at version external-secrets.io/v1beta1: conversion webhook for external-secrets.io/v1alpha1, Kind=ExternalSecret returned invalid metadata: invalid metadata of type <nil> in input object
If you wait long enough, Kustomization reconciliation will eventually succeed, and ExternalSecret objects will be upgraded to v1beta1, but Kustomization object will still keep to get those warnings endlessly, and Kustomization will randomly become degraded because of that.
If v1beta1 ExternalSecret is deployed from scratch, there are no problems.
Reproduction steps:
- Deploy external-secrets v0.3.10
- Deploy ExternalSecret object with apiVersion v1alpha1 using Flux2 Kustomization (must contain
wait: true
field in spec to enable health checks) - Upgrade external-secrets to v0.5.3
- Change ExternalSecret object manifest in Flux source to v1beta1
- Wait for some time (exact time may vary) and check events of Kustomization object
Observations (Constraints, Context, etc):
Kubernetes 1.22 (EKS) external-secrets v0.5.3 flux2 v0.30.2
Note: Flux2 kustomize-controller performs server-side apply when handling objects.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 3
- Comments: 20 (6 by maintainers)
Hi All,
For anyone else experiencing this issue I thought i’d add some information on how I was able to solve this. The issue I found was the managed fields on objects (after upgrading) had ownership of fields for both v1alpha1 and v1beta1. Removing the v1alpha1 field ownership fixed the problem.
To fix it I had to pull out the objects with their managed fields via kubectl, remove the offending block, then reapply:
The block to remove looks like this:
and is under
.metadata.managedFields
The cluster is
1.23.6
. But I am not sure it is directly related the external-secrets conversion webhook - I have added some logging of request/response to the webhook and so far I have found nothing out of the ordinary there. Running thekubectl
command above it occurs like 1 out of 4 or 5 times. I am considering trying to set up akind
based cluster to see if it is reproducible there, as a lot of calls goes to the webhook in our test cluster so it can be hard to find the correct logs.The error is originating here: https://github.com/kubernetes/apiextensions-apiserver/blob/master/pkg/apiserver/conversion/webhook_converter.go#L388 Through the logic in here: https://github.com/kubernetes-sigs/structured-merge-diff/blob/master/merge/update.go#L233
But since I cannot see anything suspicious in the logged responses from the external-secrets webhook I am leaning towards that this might be something a bit more general - maybe “dry-runs” are not done that often in combination with a conversion. Of course we will eventually convert the resources to v1beta1 but there will be some period for the migration and since this seems “random” eventually Flux probably get it reconciled eventually. But it would be nice to know the real cause of this 😃
FYI: https://gist.github.com/langecode/1d42bf97cac71c0c664b530c30902251 Changed
webhook.go
around line 96-108 and again at 147-152 - not for production use, of course, but it got the request/response payload out - and also added logs in the actual conversion functions similar to the image you (@moolen ) have built.Still looking into this but just for the record I found a way to mimic the dry-run from the Flux Kustomization controller - thus reproducing the problem. Doing an apply along the lines of the following will, at least in my environment, periodically give same error as seen by Flux.
(the specification of the
field-manager
is only to simulate the Flux kustomize controller of course)We also have this issue. All of our manifests are
v1beta1
and we have deleted all ExternalSecret resources in our clusters (so Flux could recreate them from scratch). Deleting the resources helped for a while but this error keeps coming backUsing v0.5.6