kubernetes: Replace does not work for services

What happened:

This issue on kubectl was closed as not a problem with kubectl, but with the server

What you expected to happen:

‘kubectl replace’ with no change on a service should work.

How to reproduce it (as minimally and precisely as possible):

reproduction included in this issue https://github.com/kubernetes/kubectl/issues/798#issue-547658418

Anything else we need to know?:

Seems to me like this comment sums up the options https://github.com/kubernetes/kubectl/issues/798#issuecomment-577321234

I got here from this issue https://github.com/helm/helm/issues/6378#issuecomment-572716384

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 17
  • Comments: 37 (17 by maintainers)

Most upvoted comments

Seriously, this has been going on for years. #65272 is over 2 years old and there have been dozens of bug reports/feature requests (in Kubernetes, Kubectl, Helm, and elsewhere) showing the problems the current behavior is causing. (List available on request.) https://github.com/kubernetes/kubernetes/issues/25241#issuecomment-278367696 is from over 3 years ago. #11237 is from 2015. The old justifications for keeping the current behavior (“works as designed”) have proven to be insufficient. It is time for a redesign.

The API should accept clusterIP: "" as meaning “assign a clusterIP” on creation and “keep the current clusterIP” on update. Since, when you set clusterIP: "" on creation, the resource does not end up with clusterIP: "", this means a resource cannot validly have clusterIP: "", and since the field is immutable, clusterIP: "" could never be a correct value for modification under the current API.

While this could be considered a breaking change, I doubt there is anyone who would complain about it. Rejecting clusterIP: "" as a modification is not helping anyone. It is not rejecting an accidental change or avoiding confusion because the meaning is unclear, and it is hard for me to imagine a use case where someone is relying on the current behavior and would be upset about the proposed new behavior (with the possible exception of people writing compliance tests). It is just causing problems for the sake of some kind of theoretical purity that is in the eye of the beholder. Treating clusterIP: "" as meaning “system should replace this with a valid setting” is equally pure.

Agree, and while we might be able to quietly patch this SPECIFIC case, we can’t do that in general.

@thockin I am not advocating that you turn PUT into PATCH in general; I am arguing that you should honor the principle that PUT is idempotent. The definition of the clusterIP field in this SPECIFIC case (where the result of PUTting a value does not result in that value being set in the resource) needs be handled specially in order to fulfill the contract of PUT. (This is based on the presumption that you can create a Service via PUT, which is recommend though not required by REST conventions, and is stated as true in this comment. I would argue that even if you required POST to create a Service, it remains the spirit of the REST convention that you can PUT what you used in POST with the exception of PUT possibly requiring the addition of a resource identifier created by POST.)

@lygasch I reject the idea that this is a bug in Helm, even though Helm is working hard (and failing) to handle the situation. I agree with you when you “Agree that adhering to expected PUT/PATCH api behavior would be best”. My point is that the expected “PUT” behavior is that if a “PUT” request succeeds, the exact same “PUT” request given a second time should succeed as well. That is what is broken and I am asking to have fixed.

@sjmiller609 @Nuru @lygasch @jansenmccloud - My inclination is to close this. I know it’s not what you want, but I do think you and most users will be better served by patch or apply in almost all cases (even knowing that Service has an ambiguity around changing ports that overlap each other which is not easily fixed because it is breaking).

Is there a reason why you can’t use patch or apply?

Is there a better reason why I should be prohibited from using replace on a Service, when I can use it for every other resource? Why should Service require special-case handling by the user?

I ask you to think of this in the larger context of using Infrastructure as Code (IaC) to manage resources, and the automation around that. In the IaC definition of a Service, the clusterIP has to be set to "" in order to get a system-assigned address.

Semantically, PUT says “make the resource exactly like this”. It can be, and is, used to ensure there is not “drift” in the configuration, i.e. that the configuration stored in the IaC code base is live. It is the correct thing to use when you want to say “here is the configuration that should be active”, and it is what automation uses to reset the configuration after it has drifted. Users should not have to treat Services specially, they should be able to replace any resources with the same spec that created them.

Stating this another way, the general contract of PUT is that it is idempotent. (A user should be able to PUT the same thing repeatedly, successfully, and without causing changes after the first accepted PUT.) By not accepting "" for clusterIP in PUT, you are breaking the contract of making PUT idempotent.

(As a matter of practicality, the Helm maintainers are trying to make up for the fact that PUT is not idempotent, but that is not an excuse for Kubernetes to leave this API in its current state.)

As I stated above, I am not even advocating for this case to be handled as a PATCH. As I see it, the value of clusterIP is handled specially but not idempotently. The value of "" currently means one thing if the resource does not exist and something else if it does exist. I am asking instead that it consistently, idempotently be taken as meaning “any valid address: the existing one if one is set, otherwise one randomly assigned by the controller”.

@khenidak nailed it on the head with his last comment. I do agree that we should not be re-defining the behaviour of a PUT request. But the inconsistency of how a Service updates the spec.clusterIP field compared to other resources results in inconsistencies in the PUT behaviour.

Once a resource defines fields in the spec object, normally Kubernetes leaves it unchanged. My understanding is that Kubernetes uses this spec field to use as a reference point to be the “ideal state of the world”, using the status field to record the “current state of the resource”.

The only offending case I’ve seen in the wild is a Service, as it updates both the status field AND the spec.ClusterIP field. The end result is that a kubectl replace with a Service defining spec.ClusterIP = "" results in differing behaviour than the rest of the resource APIs.

From Helm’s point of view, a chart author’s only course of action is to

a) omit spec.clusterIP from the manifest and all subsequent updates, or b) if you happened to define spec.clusterIP = "" in your first install, find the value Kubernetes provided and inject it into your template so that the value Kubernetes provided is preseved.

Failing to do so results in a failed PUT (helm upgrade --force) request. I wrote this behaviour down in-depth in this Helm comment: https://github.com/helm/helm/issues/6378#issuecomment-557746499

I’d propose an alternative solution than @khenidak’s solutions: instead of updating spec.clusterIP after an IP has been assigned to the Service, use the status API, i.e. status.clusterIP. It would be a breaking change to the API, but it should bring Services back in line with the rest of the resource APIs, leaving spec.clusterIP as a user-defined field.

At this point I am willing to call it a bug, albeit an unpleasant one.

The change was released in the 1.22.0-beta.2, so yes it should be available in 1.22. See https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.22.md#api-change-2

I do not believe it was backported to 1.21.

Thanks for confirming and providing the details @bacongobbler

I am probably missing some subtlety here, but as in https://github.com/helm/helm/issues/7956#issuecomment-640912141 I suspect that Helm is behaving properly and the problem is in way Kubernetes handles Services: the spec of a resource should never be rewritten with variable environmental information; that is what status is for. In other words .spec.clusterIP could be set in the unusual case that you wished to pick an IP yourself but would normally be null; .status.clusterIP would reflect the service’s current value, expected to be retained across unrelated .spec patches.

It is a bug. Not because it does not match rest api semantics. But because it is not consistent with the rest of the apis. In other words, If user can do replace (aka idempotent PUT) on other resources they should be able to do it on Services as well. It is hard (and weird) to ask users to do one thing for all resources, but then do a totally different thing for Services.

As discussed over sig-network the problem will be user intent. Some ideas for a solution:

  • If old.Spec.ClusterIP=<IP> && new.Spec.ClusterIP=<empty-string> irrespective of changes in other new.Spec.OtherFields we can assume that the user is replacing and does not care about ClusterIP value, hence copying old.Spec.ClusterIP to new new.Spec.ClusterIP should be fine. It is basically like the user saying I am ok with whatever clusterIP you give me, or change it so copying old=>new should be Ok.
  • If old.Spec.ClusterIP=<IP> && new.Spec.ClusterIP=<other-IP> that is a failure case, because it breaks ClusterIP immutability rules (@thockin we should really, really remove ClusterIP immutability but i digress).
  • We will still require the validation to work as is. e.g. replacing with service.Spec.Type with a type that does not require ClusterIP and not setting it to <empty-string> should fail.

I can look into building a POC for this.

/cc (i know i am going to regret this)

@thockin wrote

As for PUT being idempotent, I don’t think that’s ever really been a hard objective.

I am shocked that you would think that, although I don’t really know what you mean by “hard objective”. The Kubernetes API Concepts doc certainly implies PUT should be idempotent, via such statements as

The Kubernetes API is a resource-based (RESTful) programmatic interface provided via HTTP. It supports retrieving, creating, updating, and deleting primary resources via the standard HTTP verbs (POST, PUT, PATCH, DELETE, GET)…

and

All objects will have a unique name to allow idempotent creation and retrieval

Moving on to how to do make it so, the idea of PUT is that at the end of the operation, the object is in the state described by the action. If you PUT ports P and Q and the existing object has ports R and S, you delete R and S and add P and Q. If adding or deleting a port is not allowed, then you reject the PUT.

I think to some extent you are overthinking/overdesigning a solution here. In Kubernetes, every PUT is to some extent a patch because there are some fields that are not specifiable via PUT, like all the managed fields, or creationTimestamp, and fields like selfLink and resourceVersion are special in a whole different way. But that stuff should already be handled by the more generalized resource handling architecture. The general rule about PUT is that PUT on an existing resource should adjust that resource to be in the same working configuration as if you used that PUT to create a new resource. Calling PUT a second time with the same input should be acceptable and should not change the resource.

I believe that Kubernetes generally has a canonical ordering of all fields in a resource (and if there is no clear way to order a list, the canonical order is simply the order in which the list was originally specified on resource creation), so what you do is enumerate the fields of the existing resource in canonical order, filter out the fields like creationTimestamp that PUT cannot specify, and then compare it to the PUT resource, after normalizing it into canonical order, and add/delete/replace field by field.

Clearly something like this is already happening in handling PUT and PATCH. This should not be a lot of new work.

The ongoing community complaint is that when PUT says clusterIP: "" and it is matched against the exiting clusterIP: 172.20.4.5 the API rejects it, saying you can’t change the clusterIP. All we are asking for is that this case be interpreted differently and allowed. clusterIP: "" is documented as an allowable value in the input, but it is never an allowable value in the realized resource, so it cannot possibly mean to be interpreted as “force this value on the resource”. Its meaning is that the system should assign a valid value and the user does not care what it is. So on PUT, when clusterIP: "" is matched against the exiting clusterIP: 172.20.4.5, he API should just confirm that clusterIP: 172.20.4.5 is a valid value and, if so, accept the input.

keepalive

I can still see the error but I’m on GKE 1.18, will this only apply to Kubernetes 1.21/22?

It will be part of 1.22 or the next 1.21.X (I don’t remember now what is the latest stable)

That document does allow for breaking changes under the apiVersion. Instead of making a breaking change to v1 services, you can introduce Service v2. Existing v1 services migrate their spec.clusterIP’s existing value (auto-generated or not) to the new version. spec.clusterIP and status.clusterIP would now match their corresponding relationships, and Kubernetes retains backwards/forwards compatibility.

/remove-sig cli /sig network

/assign @thockin this is sort of the inverse of the issue fixed in https://github.com/kubernetes/kubernetes/pull/95196

@Nuru

Using kubectl replace with a static manifest on an object that has spec fields populated by the API server (like clusterIP in service) or controller (like replicas in deployment) is an anti-pattern. kubectl apply is a better match for this usage. Alternatively, a controller that did a ‘read -> modify -> replace’ could be ok, since it would be starting with the current API server state.