kubectl: kubectl rollout undo should warn about undefined behaviour with kubectl apply

What would you like to be added: Make kubectl rollout undo and kubectl apply work together. kubectl rollout undo doesn’t revert the kubectl.kubernetes.io/last-applied-configuration annotation, which leads to problems if named objects in arrays are intended to be removed.

Why is this needed: I ran into the issue that I had an old container still being present after applying a rename of a sidecar. Symptoms look like these issues: https://github.com/kubernetes/kubernetes/issues/63982 and https://github.com/kubernetes/kubernetes/issues/25390.

What happened to me, is as follows:

  • deployment has containers a + b
  • I wanted to rename b to c and change c’s implementation
  • I ran kubectl apply with the changed yaml with a version of c which contains an error
  • rollout status doesn’t finish, because the pod goes into a CrashLoopBackOff
  • I see the failure occur and decide to revert to the last version: kubectl rollout undo
  • the deployment is reverted and the new replica set is scaled down
  • however the kubectl.kubernetes.io/last-applied-configuration annotation on the deployment’s metadata is not rolled back
  • the next time I deploy the a + c yaml for the deployment, b is not in the last applied configuration so kubernetes config merge algorithm doesn’t remove b
  • the pods are running with containers a + b + c

The rollout undo command doesn’t rollback kubectl.kubernetes.io/last-applied-configuration which doesn’t make sense imho.

I reported this to Google Support, with a reproduction, their answer was this:

We do not typically recommend using “rollout” to be honest, in general we would recommend re-applying the old config.

There are a lot of cases where the old config no long is available to the user, especially since rollout undo is only done when there is something going wrong. At these times you want kubectl to be your friend who has your back…

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 2
  • Comments: 42 (31 by maintainers)

Most upvoted comments

kubectl rollout undo is an imperative command, similar to kubectl edit, and not intended to be used with kubectl apply in its original design. For kubectl apply users, modifying configs (e.g. revert changes) and then apply changes through kubectl apply is the declarative way to do a rollback.

/retitle kubectl rollout undo should warn about undefined behaviour with kubectl apply

(I think it is mislead to accept this as previously titled; alternatively we can revert the title and the triage and create a separate issue for the warning)

We discussed this at last week’s SIG meeting and will follow up again next week after taking some time to reflect. You can listen to the conversation starting at https://youtu.be/Wh6eU1AfHBY?t=294. Notes follow.

This is happening because the implementation of kubectl rollout undo essentially copies the spec of the targeted ReplicaSet into the Deployment, without touching the last-applied annotation. That means last-applied continues to reflect the spec from the apply operation that created the newer ReplicaSet we moved off of, not the older one whose spec is now back in the Deployment. Because of this, the strategic merge calculation performed by the next kubectl apply will be making an illogical and potentially invalid comparison. In the case where the set of fields under management differ among the three points in time in question, apply may draw conclusions about field ownership that are perfectly correct given the inputs, but incorrect given the real history / actual user intent.

In @johngmyers example, when the rollback happens, the Deployment ends up in a state where the spec has container “foo”, and the last-applied annotation has container “bar”. When apply subsequently runs with container “bar” added again, it sees only “bar” in the annotation. It therefore assumes that a different actor added “foo” and it should be left alone.

This is impossible to fix with code alone: we’d need to start keeping more historical data than we do today. Specifically, we’d need to preserve the last-applied information by copying it to ReplicaSet. It is deliberately not being copied over today, because doing so would cause apply to think it manages the ReplicaSet directly (it doesn’t). However, we could copy the data over into a different annotation, and then perform the reverse translation when copying back during a rollback. Something similar would need to be done for the newer metadata.managedFields used by server-side apply as well. Overall, it would be a significant increase in the size of all ReplicaSet objects created from Deployments managed by apply, for all users. We need to carefully consider whether this bug fix is worth that non-trivial cost.

SIG-CLI often faces challenges related to the interaction between the imperative and declarative commands kubectl provides. For example, making changes with kubectl edit between apply operations is capable of causing a very similar problem. My personal opinion is that rollout undo falls in a grey area between the two types of command. The user’s intent in this case isn’t to make a manual change; it’s to roll back the entire Deployment object to its previous state. Where the apply data is stored is an implementation detail they may have no awareness of, and there’s nothing that signals to them that the rollback will be incomplete in an important way. Also unlike the crud imperative commands, rollout does not strike me as something we can call intended for experimental rather than production use. On the contrary, it sounds like something you should be able to reach for in an emergency.

Points to reflect on and come back to at next week’s meeting:

  • What is the purpose of this command? If we feel it should never be used in production, why are we keeping it?
  • Is the price we’d pay in storage to fix this worth the benefit?
  • If not, are there any other solutions we haven’t considered?
  • If not, how can we make this safer to use? (e.g. emit warnings when relevant? provide repair instructions in docs?)

cc @seans3 @eddiezane @ant31

Reproduction procedure:

cat >sample.yaml <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-app
  labels:
    app: test-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test-app
  template:
    metadata:
      labels:
        app: test-app
    spec:
      containers:
        - name: foo
          image: gcr.io/google_containers/pause-amd64:3.0
EOF
cat >sample-new.yaml <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-app
  labels:
    app: test-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test-app
  template:
    metadata:
      labels:
        app: test-app
    spec:
      containers:
        - name: bar
          image: gcr.io/google_containers/pause-amd64:3.0
EOF
kubectl apply -f sample.yaml
kubectl apply -f sample-new.yaml
kubectl rollout undo deployment/test-app
kubectl apply -f sample-new.yaml

Expected result: Deployment containing one container named bar. Actual result: Deployment containing two containers: foo and bar.