kubernetes: Admission Controller Webhook configuration rule cannot exclude specific resources
What would you like to be added:
The webhook configuration (for both ValidatingWebhookConfiguration and MutatingAdmissionWebhook, reference https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/) support rules to select what types of resources should be sent to the webhook endpoint. The rule definitions can have a set of resources which specifies which resource types should be selected.
What is missing is the ability to specify what resources to exclude. There does not appear to be a way to give any negative selection criteria.
Why is this needed: In practice, systems where an external policy engine like OPA are being used as the webhook endpoint, it is common to want broad coverage of the admission policy. This is typically done by specifying wildcards in the webhook configuration and allowing the policies in OPA to self select what types of objects they apply to.
In these same systems, with wildcard webhook configurations, if the failurePolicy is set to Fail it is extremely difficult for the rules definitions to allow for a cluster to recover in network failure modes.
For example, if for sake of argument, the network is based on Calico and requires objects in the cluster to be modified (eg ippools, felixconfigurations, clusterinformations, bgpconfigurations, bgppeers, etc) when the network is down, the api-server may not be able to successfully call the webhook, which can prevent the network controllers from converging back to a working state. The only way to correct the issue is to remove the webhook, allow the network to repair, and re-apply it. Throughout this time however it means that the cluster is potentially un-protected.
Ideally a cluster admin could configure the webhook to exclude certain resources. The only option right now appears to be to specify every type of resource expected to be in the cluster and omit the ones that should be excluded. In practice, the list of all resources allowed on the cluster is often unknown at the time the webhook is configured, making this approach infeasible.
Note: The namespace selector will not help in this scenario, as there can be cluster scoped resources involved.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 18
- Comments: 51 (40 by maintainers)
Commits related to this issue
- Fix Namespace selector & ensure correct wh version The name space selectors must be in a specific format https://github.com/kubernetes/kubernetes/issues/92157#issuecomment-825160958 Once those are tu... — committed to jsturtevant/osm by jsturtevant 2 years ago
- Fix Namespace selector & ensure correct wh version The name space selectors must be in a specific format https://github.com/kubernetes/kubernetes/issues/92157#issuecomment-825160958 Once those are tu... — committed to jsturtevant/osm by jsturtevant 2 years ago
- Fix Namespace selector & ensure correct wh version The name space selectors must be in a specific format https://github.com/kubernetes/kubernetes/issues/92157#issuecomment-825160958 Once those are tu... — committed to jsturtevant/osm by jsturtevant 2 years ago
- Fix Namespace selector & ensure correct wh version The name space selectors must be in a specific format https://github.com/kubernetes/kubernetes/issues/92157#issuecomment-825160958 Once those are tu... — committed to jsturtevant/osm by jsturtevant 2 years ago
- Fix Namespace selector & ensure correct wh version The name space selectors must be in a specific format https://github.com/kubernetes/kubernetes/issues/92157#issuecomment-825160958 Once those are tu... — committed to jsturtevant/osm by jsturtevant 2 years ago
- Fix Namespace selector & ensure correct wh version The name space selectors must be in a specific format https://github.com/kubernetes/kubernetes/issues/92157#issuecomment-825160958 Once those are tu... — committed to jsturtevant/osm by jsturtevant 2 years ago
- Fix Namespace selector & ensure correct wh version The name space selectors must be in a specific format https://github.com/kubernetes/kubernetes/issues/92157#issuecomment-825160958 Once those are tu... — committed to jsturtevant/osm by jsturtevant 2 years ago
- Fix Namespace selector & ensure correct wh version (#5191) The namespace selectors must be in a specific format https://github.com/kubernetes/kubernetes/issues/92157#issuecomment-825160958 Once tho... — committed to openservicemesh/osm by jsturtevant 2 years ago
- --EKS-PATCH-- admission webhook exclusion from file Workaround for Kubernetes issue: https://github.com/kubernetes/kubernetes/issues/92157 This change allows for the bypassing of admission controlle... — committed to kmala/kubernetes by ivelichkovich 2 years ago
- --EKS-PATCH-- admission webhook exclusion from file Workaround for Kubernetes issue: https://github.com/kubernetes/kubernetes/issues/92157 This change allows for the bypassing of admission controlle... — committed to kmala/kubernetes by ivelichkovich 2 years ago
as of 1.21, namespaces have a default
kubernetes.io/metadata.namelabel on them matching their name (see https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/2161-apiserver-default-labels/README.md)if you want to exclude specifically named namespaces in addition to namespaces with your excluded-namespace label, in 1.21+ you can do this:
/assign
I’d like to help drive this forward for v1.27. I’ll get started on a KEP.
/milestone v1.27
To be clear, @liggitt’s answer addresses @yiqigao217’s concern, mainly I wanted to keep focus on the core issue.
I had a chat with @maxsmythe and @tallclair about this and I had another idea.
The theory is that a webhook author really does want their policy enforced. But for some operations, it’s more important that the cluster not burst into flames. And the webhook author might not know which operations those are for the cluster.
So, the idea is, instead of making every webhook separately exclude some operations, what if we made a global critical-operation constraint that listed specific types and operations, and listed the maximum timeout?
So, now, when calling a webhook, we match the {type, operation} with both the webhook configurations and the critical-operation constraint. If it matches the latter, when we call the webhook we use the Ignore failure policy and the listed max timeout. Otherwise, we use the webhook’s desired values. If this causes us to ignore a webhook, we can e.g. write an audit log about it.
This separation would permit people who know what’s performance sensitive in the cluster to write down a sensible critical-operation constraint, and free webhook authors from having to know this (which is good, because it could be different in different clusters).
Of course, webhook authors will have to deal with getting “Ignore” when they asked for “fail” sometimes. I think this is acceptable, because they should already have a post-facto reconciliation mechanism (gatekeeper does). If a webhook can’t deal with this, then the cluster that wants to run the webhook shouldn’t use the critical-operation constraint feature, and that’s fine–some people probably do want their cluster to burst into flames rather than admit a policy-violating object even for a millisecond.
Excellent points all, I agree a meeting would be good as there is a lot of data bandwidth in this discussion.
A few counter-points:
I think “installed correctly” is a matter of perspective here. In cases where there is a root system admin placed above the policy admin, they would need the ability to scope webhooks such that they don’t interfere with cluster ops. NPC is better for validating scope limitations.
“safer” is also a matter of perspective. A policy author probably does want as wide an enforcement scope as possible, however a sysadmin may know that calling webhooks on certain objects is too dangerous for the availability of the cluster and detection + manual remediation are the safest course.
This is not necessarily true. It is certainly possible to write a policy that applies to all resources. Any restriction on metadata for example.
This is a good idea generally speaking. I don’t think it completely solves the issue, as a well-functioning policy could still be subject to transient errors such as webhook downtime, or changes to the system after policy installation, such as installing Istio.
This is important to keep in mind, but the inverse of this logic is that if we delay adding this feature, its GA would be that much further out. I believe this would be a generally useful feature. Even if kludgy workarounds are necessary in the short term, knowing there is a path forward is important.
The problem here is that a second webhook would need to be added, so a webook would still be called against critical resources.
Temporarily setting aside delegation and whether policies should be universally enforced, there is still the problem that any calling of a webhook whatsoever can be dangerous to cluster availability for certain resources that core cluster functionality relies on.
Bringing back the delegation consideration… you are assuming that authors of policies are experts in K8s clusters and know which resources may be problematic to enforce policy on via webhook. In a specialized work environment where policies are set by compliance teams, that may not be the case. Defining a safe set of resources where compliance teams do not need to worry about breaking the cluster allows them to be more responsive to new risks and compliance requirements as they do not need to coordinate across departments.
In those cases where policy authors still want compliance checks on resources where active enforcement would be too dangerous… this is where a reconciliation plan is necessary, usually consisting of regular audits.
END REPLIES
One thing I haven’t heard an adequate alternative for is making sure the escape hatch of deleting the webhook config object is preserved. There was one suggestion of manually triggering reconciles, but this breaks the declarative model as it makes the user responsible for actuation.
I would also like to signal boost @tallclair’s point about statically configured webhooks.
One area where this breaks down is if the policy admin and the cluster admin are two different people. The biggest reason to avoid webhook calls is to prevent downtime in the webhook from causing resources to become un-modifiable. It is certainly reasonable for the policy admin to say something like “I want everyone to set an owner label on their resources” and for the root admin to specifically carve out important resources from that policy because the impact on the cluster would be too great. In larger organizations, the person setting policy and the person managing the cluster may not have overlapping knowledge sets and allowing the cluster admin to prevent sub-admins from breaking their cluster seems a compelling use case.
Setting the policy on the webhook also prevents privilege-escalation-type DOS attacks where a malicious webhook admin (who otherwise has no access to VWH configs) adds resources that they know would bork the cluster, causing the ValidatingWebhookConfiguration to get modified to include those resources.
Probably the biggest reason to avoid doing this is that it potentially closes the “escape hatch” built into validating webhooks. By default, VWH configurations are not subject to webhook validation because doing so could cause a circular dependency whereby the cluster is hard-downed with no remedy. Since VWH configuations can always be removed, a bad webhook can always be disabled. If you have a controller enforcing the existence of this webhook configuration, then that escape hatch is at risk – a user could attempt to modify/delete the VWH config only to see it recreated, leaving them unable to remedy their webhook issue.
Thats an interesting point. I guess the downside is that there would be a window of opportunity when a new CRD is added and the proposed controller detects it and updates the webhook configuration where policies that would otherwise have been applied were ignored. I would hazard a guess that the folks who want fail-closed admission webhooks would not be keen on having that sort of problem.