kapp: Conflict on weird fields

Hi, i dont really understand some conflict errors, maybe someone can help

Heres an example; these fields appear in the diff :

  • kapp.k14s.io/nonce : sounds legit
  • image : legit as its a new version
  • initialDelaySeconds and cpu : i guess its been “rewritten” by kube API

These changes looks legit but make kapp fails, any idea how to prevent this ?

    Updating resource deployment/app-strapi (apps/v1) namespace: env-1000jours-sre-kube-workflow-4y3w36:
      API server says:
        Operation cannot be fulfilled on deployments.apps "app-strapi": the object has been modified; please apply your changes to the latest version and try again (reason: Conflict):
          Recalculated diff:
 11, 10 -     kapp.k14s.io/nonce: "1660057353002414185"
 12, 10 +     kapp.k14s.io/nonce: "1660062422261409721"
223,222 -   progressDeadlineSeconds: 600
225,223 -   revisionHistoryLimit: 10
230,227 -   strategy:
231,227 -     rollingUpdate:
232,227 -       maxSurge: 25%
233,227 -       maxUnavailable: 25%
234,227 -     type: RollingUpdate
237,229 -       creationTimestamp: null
269,260 -         image: something/strapi:sha-3977fb22378f2debdcacf4eeb6dd6f26dab24377
270,260 -         imagePullPolicy: IfNotPresent
271,260 +         image: something/strapi:sha-4ed2921f2fac053671f80fa02b72d124a23fa8c0
276,266 -             scheme: HTTP
279,268 -           successThreshold: 1
285,273 -           protocol: TCP
291,278 -             scheme: HTTP
292,278 +           initialDelaySeconds: 0
297,284 -             cpu: "1"
298,284 +             cpu: 1
300,287 -             cpu: 500m
301,287 +             cpu: 0.5
307,294 -             scheme: HTTP
309,295 -           successThreshold: 1
310,295 -           timeoutSeconds: 1
311,295 -         terminationMessagePath: /dev/termination-log
312,295 -         terminationMessagePolicy: File
316,298 -       dnsPolicy: ClusterFirst
317,298 -       restartPolicy: Always
318,298 -       schedulerName: default-scheduler
319,298 -       securityContext: {}
320,298 -       terminationGracePeriodSeconds: 30

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 58 (28 by maintainers)

Most upvoted comments

Hey !!

I finally resolved this issue that was occasioned by many factors (but finally only one was determining),

hypothesis # 1 The Bad

first, rancher was adding metadata.annotations."field.cattle.io/publicEndpoints" and the fix you gave us, using rebase rule is working for this issue, this is now patched in kube-workflow (legacy) and kontinuous @revolunet here are the fix (you could also put this content in the file created here https://github.com/SocialGouv/1000jours/commit/a81b816b71dc995690b64012d5bad9be02108983 the format I use is to be consumed by cli, the other is to be consumed by kapp kube controller which we don’t use):

hypothesis # 2 The Ugly

#kapp+sealed-secret+reloader but the other thing that was breaking everything was, the combination of sealed-secret + reloader theses tools are compatibles but the behavior of the both combined with kapp is not, here is the process that break things:

  • kapp create/update sealed-secret resources on the cluster
  • the sealed-secret operator unseal the secret(s) and create/update the secret on the cluster
  • reloader operator detect new secret and restart the deployment making an update, now the deployment is not same version as before I don’t now what is the better approach to solve this, if it has to be solved at reloader, sealed-secret or kapp level. But at this time, I don’t see option that can be used on any of these tools to resolve the conflict. At this time the only solution to workaround this is to not use reloader and to implement kapp versioned-resources to ensure that the last version of unsealed secret will be used by deployment. (finally, I’m not sure there is an issue here, but I share it to have your feedback if you have any on this and are thinking to a thing that I don’t)

hypothesis # 3 The Good one

Finally, a thing that I didn’t understood, it was the link between the command in the job and the deployment. When we had pg_restore in the job that was failing but when we replaced by sleep 240 (according to time that was taking to run pg_restore) it was working. I was first thinking that was related ressources used, so I reserved large ressources for the job. But that was impacting even the rancher annotations (maybe the network usage had a side effect on operator, modifying the global behavior, very weird I was thinking). Finally, after disabled reloader, the deployment doesn’t seem to reboot, so I was thinking it was resolved, but few try later, the deployment started to reboot on kapp deploy before job endeed (the job is in change group that is required by change rule on deployment). Sorry for the unsustainable suspens (but it take me tens of hours)… It was the pod that was crashing. I totally didn’t knew how this service was supposed to work, but there was a poll every few seconds that was interracting with DB, and while the pg_restore was running, inconsistent data made it crash and restart. This restart, done by kube-controller-manager was making change on the manifests. I don’t know if this is an issue that can (and should) be treated at kapp level. But for now we can resolve this on our side.

Sorry for big bazzard (and excuse me for my poor english). Thanks for your help and patience. And big up for developing this great tool that is kapp, we are using it every day !

Thank you! Trimming the extra stuff and keeping the necessary details. Just like the first comment, it’s the identity annotation that is causing the issue. As @100mik mentioned previously, we definitely need to find out and fix the issue with this annotation. I will bump the priority for this. Meanwhile I will also try to look for a short term solution for this.

Target cluster 'https://rancher.******'
@@ update deployment/simulateur (apps/v1) namespace: egapro-feat-add-index-subrouting-for-declatation-djn8zr @@
  ...
 11     -     kapp.k14s.io/identity: v1;egapro-feat-add-index-subrouting-for-declatation-djn8zr/apps/Deployment/simulateur;apps/v1
 12     -     kapp.k14s.io/nonce: "1664877613047615413"
     11 +     kapp.k14s.io/nonce: "1664880936933517787"
200     -   progressDeadlineSeconds: 600
202     -   revisionHistoryLimit: 10
207     -   strategy:
208     -     rollingUpdate:
209     -       maxSurge: 25%
210     -       maxUnavailable: 25%
211     -     type: RollingUpdate
214     -       creationTimestamp: null
241     -       - image: harbor.fabrique.social.gouv.fr/egapro/egapro/simulateur:sha-dd68d2376c6a3bc3896578fba4fdf652046a17ad
242     -         imagePullPolicy: IfNotPresent
    232 +       - image: harbor.fabrique.social.gouv.fr/egapro/egapro/simulateur:sha-c4934d8459daf82ab93b3e661f2cd4b8a3353672
248     -             scheme: HTTP
251     -           successThreshold: 1
257     -           protocol: TCP
263     -             scheme: HTTP
280     -             scheme: HTTP
282     -           successThreshold: 1
283     -           timeoutSeconds: 1
284     -         terminationMessagePath: /dev/termination-log
285     -         terminationMessagePolicy: File
286     -       dnsPolicy: ClusterFirst
287     -       restartPolicy: Always
288     -       schedulerName: default-scheduler
289     -       securityContext: {}
290     -       terminationGracePeriodSeconds: 30
---
10:57:04AM: update deployment/simulateur (apps/v1) namespace: egapro-feat-add-index-subrouting-for-declatation-djn8zr
[2022-10-04 10:57:04] WARN: kapp: Error: Applying update deployment/simulateur (apps/v1) namespace: egapro-feat-add-index-subrouting-for-declatation-djn8zr:
  Failed to update due to resource conflict (approved diff no longer matches):
    Updating resource deployment/simulateur (apps/v1) namespace: egapro-feat-add-index-subrouting-for-declatation-djn8zr:
      API server says:
        Operation cannot be fulfilled on deployments.apps "simulateur": the object has been modified; please apply your changes to the latest version and try again (reason: Conflict):
          Recalculated diff:
 11, 11 -     kapp.k14s.io/nonce: "1664877613047615413"
 12, 11 +     kapp.k14s.io/nonce: "1664880936933517787"
199,199 -   progressDeadlineSeconds: 600
201,200 -   revisionHistoryLimit: 10
206,204 -   strategy:
207,204 -     rollingUpdate:
208,204 -       maxSurge: 25%
209,204 -       maxUnavailable: 25%
210,204 -     type: RollingUpdate
213,206 -       creationTimestamp: null
240,232 -       - image: harbor.fabrique.social.gouv.fr/egapro/egapro/simulateur:sha-dd68d2376c6a3bc3896578fba4fdf652046a17ad
241,232 -         imagePullPolicy: IfNotPresent
242,232 +       - image: harbor.fabrique.social.gouv.fr/egapro/egapro/simulateur:sha-c4934d8459daf82ab93b3e661f2cd4b8a3353672
247,238 -             scheme: HTTP
250,240 -           successThreshold: 1
256,245 -           protocol: TCP
262,250 -             scheme: HTTP
279,266 -             scheme: HTTP
281,267 -           successThreshold: 1
282,267 -           timeoutSeconds: 1
283,267 -         terminationMessagePath: /dev/termination-log
284,267 -         terminationMessagePolicy: File
285,267 -       dnsPolicy: ClusterFirst
286,267 -       restartPolicy: Always
287,267 -       schedulerName: default-scheduler
288,267 -       securityContext: {}
289,267 -       terminationGracePeriodSeconds: 30

Heyo! Sorry for the delay I was verifying a few options.

For the time being you could add the following kapp Config to you manifests:

apiVersion: kapp.k14s.io/v1alpha1
kind: Config

diffAgainstLastAppliedFieldExclusionRules:
- path: [metadata, annotations, "kapp.k14s.io/identity"]
  resourceMatchers:
  - apiVersionKindMatcher: {apiVersion: apps/v1, kind: Deployment}

This would exclude the problematic field while diffing all together.

If you already have a kapp Config you can just amend it with:

diffAgainstLastAppliedFieldExclusionRules:
- path: [metadata, annotations, "kapp.k14s.io/identity"]
  resourceMatchers:
  - apiVersionKindMatcher: {apiVersion: apps/v1, kind: Deployment}

we use also now another tool in parallel of kapp that we have forked recently and that is able to detect commons errors on deployment to allow fail fast and better debugging messages, maybe in future theses features could be integrated in kapp.

Thank you so much for sharing. We will definitely take a look at it and let you know our next steps 😃

workaround this is to not use reloader and to implement kapp versioned-resources to ensure that the last version of unsealed secret will be used by deployment.

This is what I was about to suggest when you mentioned you are using reloader! This would ensure that every part of the update is handled by kapp. It might reduce some overhead as well!

Sorry for the unsustainable suspens (but it take me tens of hours)…

No worries! Happy to hack through this with you

I don’t know if this is an issue that can (and should) be treated at kapp level. But for now we can resolve this on our side.

Trying to process all the information, but two thoughts come to mind.

  1. Are the change rules working as expected?
  2. Are you using versioned resources to update the deployment now?

And big up for developing this great tool that is kapp, we are using it every day !

We are glad it helps!

@revolunet We will drop a ping on this issue when we have a release which resolves this.

Thanks for the prompt replies!

Gonna take a closer look at this, this is definitely not expected. However, I cannot reproduce the exact issue y’all have been running into 😦

The closest I could get was over here in the similar reproduction I posted, where kapp shows that the identity annotation is being removed when it is not.

Marking this as a bug for now, since looks like the metadata on the deployment is as expected (assuming that env-xxx-5dc5hx is the ns you are working with)

Ok here’s the top of the diff for that deployment :

note 1b7c24b0876fdb5c244aa3ada4d96329eb72e1a4 is the sha of the image currently running in the namespace

update deployment/app-strapi (apps/v1) namespace: env-xxx-5dc5hx @@
  ...
  8,  8       kapp.k14s.io/change-rule.restore: upsert after upserting kube-workflow/restore.env-xxx-5dc5hx
  9,  9       kapp.k14s.io/create-strategy: fallback-on-update
 10, 10       kapp.k14s.io/disable-original: ""
 11     -     kapp.k14s.io/identity: v1;env-xxx-5dc5hx/apps/Deployment/app-strapi;apps/v1
 12     -     kapp.k14s.io/nonce: "1660207590418011865"
     11 +     kapp.k14s.io/nonce: "1660209982534815766"
 13, 12       kapp.k14s.io/update-strategy: fallback-on-replace
 14, 13     creationTimestamp: "2022-08-11T08:49:11Z"
 15, 14     generation: 2
 16, 15     labels:
  ...
222,221     resourceVersion: "247917466"
223,222     uid: 2e7466f0-20aa-452c-9f24-b344a4723716
224,223   spec:
225     -   progressDeadlineSeconds: 600
226,224     replicas: 1
227     -   revisionHistoryLimit: 10
228,225     selector:
229,226       matchLabels:
230,227         component: app-strapi
231,228         kubeworkflow/kapp: xxx
232     -   strategy:
233     -     rollingUpdate:
234     -       maxSurge: 25%
235     -       maxUnavailable: 25%
236     -     type: RollingUpdate
237,229     template:
238,230       metadata:
239     -       creationTimestamp: null
240,231         labels:
241,232           application: xxx
242,233           component: app-strapi
243,234           kapp.k14s.io/association: v1.b90f821a0c6[816](https://github.com/xxx/runs/7783997896?check_suite_focus=true#step:2:837)e919c5ec622aa834cc
  ...
268,259               name: strapi-configmap
269,260           - secretRef:
270,261               name: pg-user-revolunet-patch-1
271     -         image: xxx/strapi:sha-1b7c24b0876fdb5c244aa3ada4d96329eb72e1a4
272     -         imagePullPolicy: IfNotPresent
    262 +         image: xxx/strapi:sha-dd16295f5e3d620ffb6874184abbf91f2b304cbf
273,263           livenessProbe:
274,264             failureThreshold: 15
275,265             httpGet:
276,266               path: /_health
277,267               port: http
278     -             scheme: HTTP
279,268             initialDelaySeconds: 30
280,269             periodSeconds: 5
281     -           successThreshold: 1
282,270             timeoutSeconds: 5
283,271           name: app
284,272           ports:
285,273           - containerPort: 1337
286,274             name: http
287     -           protocol: TCP
288,275           readinessProbe:
289,276             failureThreshold: 15
290,277             httpGet:
291,278               path: /_health
292,279               port: http
293     -             scheme: HTTP
294,280             initialDelaySeconds: 10
295,281             periodSeconds: 5
296,282             successThreshold: 1
297,283             timeoutSeconds: 1
  ...
307,293             httpGet:
308,294               path: /_health
309,295               port: http
310     -             scheme: HTTP
311,296             periodSeconds: 5
312     -           successThreshold: 1
313     -           timeoutSeconds: 1
314     -         terminationMessagePath: /dev/termination-log
315     -         terminationMessagePolicy: File
316,297           volumeMounts:
317,298           - mountPath: /app/public/uploads
318,299             name: uploads
319     -       dnsPolicy: ClusterFirst
320     -       restartPolicy: Always
321     -       schedulerName: default-scheduler
322     -       securityContext: {}
323     -       terminationGracePeriodSeconds: 30
324,300         volumes:
325,301         - emptyDir: {}
326,302           name: uploads

so i can add --diff-changes=true --diff-context=4 in the code above and get more diff ?

Yeah, comparing the original diff with the recalculated diff would give us an idea of the fields that are getting updated in the background and we could then try to figure out a way to resolve it (maybe a rebase rule to not update those fields).