kubernetes: Replace does not work for services
What happened:
This issue on kubectl was closed as not a problem with kubectl, but with the server
What you expected to happen:
‘kubectl replace’ with no change on a service should work.
How to reproduce it (as minimally and precisely as possible):
reproduction included in this issue https://github.com/kubernetes/kubectl/issues/798#issue-547658418
Anything else we need to know?:
Seems to me like this comment sums up the options https://github.com/kubernetes/kubectl/issues/798#issuecomment-577321234
I got here from this issue https://github.com/helm/helm/issues/6378#issuecomment-572716384
Environment:
- Kubernetes version (use
kubectl version
): - Cloud provider or hardware configuration:
- OS (e.g:
cat /etc/os-release
): - Kernel (e.g.
uname -a
): - Install tools:
- Network plugin and version (if this is a network-related bug):
- Others:
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 17
- Comments: 37 (17 by maintainers)
Seriously, this has been going on for years. #65272 is over 2 years old and there have been dozens of bug reports/feature requests (in Kubernetes, Kubectl, Helm, and elsewhere) showing the problems the current behavior is causing. (List available on request.) https://github.com/kubernetes/kubernetes/issues/25241#issuecomment-278367696 is from over 3 years ago. #11237 is from 2015. The old justifications for keeping the current behavior (“works as designed”) have proven to be insufficient. It is time for a redesign.
The API should accept
clusterIP: ""
as meaning “assign a clusterIP” on creation and “keep the current clusterIP” on update. Since, when you setclusterIP: ""
on creation, the resource does not end up withclusterIP: ""
, this means a resource cannot validly haveclusterIP: ""
, and since the field is immutable,clusterIP: ""
could never be a correct value for modification under the current API.While this could be considered a breaking change, I doubt there is anyone who would complain about it. Rejecting
clusterIP: ""
as a modification is not helping anyone. It is not rejecting an accidental change or avoiding confusion because the meaning is unclear, and it is hard for me to imagine a use case where someone is relying on the current behavior and would be upset about the proposed new behavior (with the possible exception of people writing compliance tests). It is just causing problems for the sake of some kind of theoretical purity that is in the eye of the beholder. TreatingclusterIP: ""
as meaning “system should replace this with a valid setting” is equally pure.@thockin I am not advocating that you turn PUT into PATCH in general; I am arguing that you should honor the principle that
PUT
is idempotent. The definition of theclusterIP
field in this SPECIFIC case (where the result of PUTting a value does not result in that value being set in the resource) needs be handled specially in order to fulfill the contract ofPUT
. (This is based on the presumption that you can create a Service viaPUT
, which is recommend though not required by REST conventions, and is stated as true in this comment. I would argue that even if you requiredPOST
to create a Service, it remains the spirit of the REST convention that you canPUT
what you used inPOST
with the exception ofPUT
possibly requiring the addition of a resource identifier created byPOST
.)@lygasch I reject the idea that this is a bug in Helm, even though Helm is working hard (and failing) to handle the situation. I agree with you when you “Agree that adhering to expected PUT/PATCH api behavior would be best”. My point is that the expected “PUT” behavior is that if a “PUT” request succeeds, the exact same “PUT” request given a second time should succeed as well. That is what is broken and I am asking to have fixed.
Is there a better reason why I should be prohibited from using replace on a Service, when I can use it for every other resource? Why should Service require special-case handling by the user?
I ask you to think of this in the larger context of using Infrastructure as Code (IaC) to manage resources, and the automation around that. In the IaC definition of a Service, the
clusterIP
has to be set to""
in order to get a system-assigned address.Semantically,
PUT
says “make the resource exactly like this”. It can be, and is, used to ensure there is not “drift” in the configuration, i.e. that the configuration stored in the IaC code base is live. It is the correct thing to use when you want to say “here is the configuration that should be active”, and it is what automation uses to reset the configuration after it has drifted. Users should not have to treat Services specially, they should be able toreplace
any resources with the same spec that created them.Stating this another way, the general contract of
PUT
is that it is idempotent. (A user should be able toPUT
the same thing repeatedly, successfully, and without causing changes after the first acceptedPUT
.) By not accepting""
forclusterIP
inPUT
, you are breaking the contract of makingPUT
idempotent.(As a matter of practicality, the Helm maintainers are trying to make up for the fact that
PUT
is not idempotent, but that is not an excuse for Kubernetes to leave this API in its current state.)As I stated above, I am not even advocating for this case to be handled as a
PATCH
. As I see it, the value ofclusterIP
is handled specially but not idempotently. The value of""
currently means one thing if the resource does not exist and something else if it does exist. I am asking instead that it consistently, idempotently be taken as meaning “any valid address: the existing one if one is set, otherwise one randomly assigned by the controller”.@khenidak nailed it on the head with his last comment. I do agree that we should not be re-defining the behaviour of a PUT request. But the inconsistency of how a Service updates the
spec.clusterIP
field compared to other resources results in inconsistencies in the PUT behaviour.Once a resource defines fields in the
spec
object, normally Kubernetes leaves it unchanged. My understanding is that Kubernetes uses thisspec
field to use as a reference point to be the “ideal state of the world”, using thestatus
field to record the “current state of the resource”.The only offending case I’ve seen in the wild is a
Service
, as it updates both thestatus
field AND thespec.ClusterIP
field. The end result is that akubectl replace
with aService
definingspec.ClusterIP = ""
results in differing behaviour than the rest of the resource APIs.From Helm’s point of view, a chart author’s only course of action is to
a) omit
spec.clusterIP
from the manifest and all subsequent updates, or b) if you happened to definespec.clusterIP = ""
in your first install, find the value Kubernetes provided and inject it into your template so that the value Kubernetes provided is preseved.Failing to do so results in a failed PUT (
helm upgrade --force
) request. I wrote this behaviour down in-depth in this Helm comment: https://github.com/helm/helm/issues/6378#issuecomment-557746499I’d propose an alternative solution than @khenidak’s solutions: instead of updating
spec.clusterIP
after an IP has been assigned to the Service, use thestatus
API, i.e.status.clusterIP
. It would be a breaking change to the API, but it should bring Services back in line with the rest of the resource APIs, leavingspec.clusterIP
as a user-defined field.At this point I am willing to call it a bug, albeit an unpleasant one.
Thanks for confirming and providing the details @bacongobbler
I am probably missing some subtlety here, but as in https://github.com/helm/helm/issues/7956#issuecomment-640912141 I suspect that Helm is behaving properly and the problem is in way Kubernetes handles
Service
s: thespec
of a resource should never be rewritten with variable environmental information; that is whatstatus
is for. In other words.spec.clusterIP
could be set in the unusual case that you wished to pick an IP yourself but would normally be null;.status.clusterIP
would reflect the service’s current value, expected to be retained across unrelated.spec
patches.It is a bug. Not because it does not match rest api semantics. But because it is not consistent with the rest of the apis. In other words, If user can do
replace (aka idempotent PUT)
on other resources they should be able to do it onServices
as well. It is hard (and weird) to ask users to do one thing for all resources, but then do a totally different thing for Services.As discussed over sig-network the problem will be user intent. Some ideas for a solution:
old.Spec.ClusterIP=<IP> && new.Spec.ClusterIP=<empty-string>
irrespective of changes in othernew.Spec.OtherFields
we can assume that the user is replacing and does not care about ClusterIP value, hence copyingold.Spec.ClusterIP
to newnew.Spec.ClusterIP
should be fine. It is basically like the user sayingI am ok with whatever clusterIP you give me, or change it
so copying old=>new should be Ok.old.Spec.ClusterIP=<IP> && new.Spec.ClusterIP=<other-IP>
that is a failure case, because it breaks ClusterIP immutability rules (@thockin we should really, really remove ClusterIP immutability but i digress).service.Spec.Type
with a type that does not require ClusterIP and not setting it to<empty-string>
should fail.I can look into building a POC for this.
/cc (i know i am going to regret this)
@thockin wrote
I am shocked that you would think that, although I don’t really know what you mean by “hard objective”. The Kubernetes API Concepts doc certainly implies
PUT
should be idempotent, via such statements asand
Moving on to how to do make it so, the idea of
PUT
is that at the end of the operation, the object is in the state described by the action. If you PUT ports P and Q and the existing object has ports R and S, you delete R and S and add P and Q. If adding or deleting a port is not allowed, then you reject the PUT.I think to some extent you are overthinking/overdesigning a solution here. In Kubernetes, every PUT is to some extent a patch because there are some fields that are not specifiable via PUT, like all the managed fields, or
creationTimestamp
, and fields likeselfLink
andresourceVersion
are special in a whole different way. But that stuff should already be handled by the more generalized resource handling architecture. The general rule about PUT is that PUT on an existing resource should adjust that resource to be in the same working configuration as if you used that PUT to create a new resource. Calling PUT a second time with the same input should be acceptable and should not change the resource.I believe that Kubernetes generally has a canonical ordering of all fields in a resource (and if there is no clear way to order a list, the canonical order is simply the order in which the list was originally specified on resource creation), so what you do is enumerate the fields of the existing resource in canonical order, filter out the fields like
creationTimestamp
that PUT cannot specify, and then compare it to the PUT resource, after normalizing it into canonical order, and add/delete/replace field by field.Clearly something like this is already happening in handling PUT and PATCH. This should not be a lot of new work.
The ongoing community complaint is that when PUT says
clusterIP: ""
and it is matched against the exitingclusterIP: 172.20.4.5
the API rejects it, saying you can’t change theclusterIP
. All we are asking for is that this case be interpreted differently and allowed.clusterIP: ""
is documented as an allowable value in the input, but it is never an allowable value in the realized resource, so it cannot possibly mean to be interpreted as “force this value on the resource”. Its meaning is that the system should assign a valid value and the user does not care what it is. So on PUT, whenclusterIP: ""
is matched against the exitingclusterIP: 172.20.4.5
, he API should just confirm thatclusterIP: 172.20.4.5
is a valid value and, if so, accept the input.keepalive
It will be part of 1.22 or the next 1.21.X (I don’t remember now what is the latest stable)
That document does allow for breaking changes under the apiVersion. Instead of making a breaking change to v1 services, you can introduce Service v2. Existing v1 services migrate their spec.clusterIP’s existing value (auto-generated or not) to the new version. spec.clusterIP and status.clusterIP would now match their corresponding relationships, and Kubernetes retains backwards/forwards compatibility.
/remove-sig cli /sig network
/assign @thockin this is sort of the inverse of the issue fixed in https://github.com/kubernetes/kubernetes/pull/95196
@Nuru
Using
kubectl replace
with a static manifest on an object that has spec fields populated by the API server (like clusterIP in service) or controller (like replicas in deployment) is an anti-pattern.kubectl apply
is a better match for this usage. Alternatively, a controller that did a ‘read -> modify -> replace’ could be ok, since it would be starting with the current API server state.