kserve: Uber Issue: KFServing admission hook causing widespread issues because its a global admission hook

/kind bug

We are getting lots of reports about problems caused because the KFServing admission hook is unavailable preventing pods from being created. The error message looks like the following

4m58s       Warning   FailedCreate                   replicaset/activator-5484756f7b          Error creating: Internal error occurred: failed calling webhook "inferenceservice.kfserving-webhook-server.pod-mutator": Post https://kfserving-webhook-server-service.kubeflow.svc:443/mutate-pods?timeout=30s: service "kfserving-webhook-server-service" not found

Here’s my understanding

  • Currently AdmissionHooks can not be scoped by label; so a pod admission hook is being applied to all pods

  • The KFServing Admission Hooks is being applied to all pods and then in the hook itself it checks whether the pod belongs to a KFServing resource and if it does applies the hook

  • However, if the KFServing web hook deployment is unavailable pod creation can be blocked

  • For a variety of reasons we are reaching into a deadlock state where

    • The WebHook is defined but the deployment for the hook is not defined so calls to the admission hook will fail
    • Pod creation now fails because the webhook is not defined

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 3
  • Comments: 43 (18 by maintainers)

Most upvoted comments

@maganaluis We need to use object selector on the mutating webhook configuration so that only kfserving labelled pods go through the KFServing pod mutator, the problem is that object selector is only supported kubernetes 1.15+ while kubeflow’s minimal requirement is still kubernetes 1.14. If you are on kubernetes 1.15+ you can use following command to solve the issue.

kubectl patch mutatingwebhookconfiguration inferenceservice.serving.kubeflow.org --patch '{"webhooks":[{"name": "inferenceservice.kfserving-webhook-server.pod-mutator","objectSelector":{"matchExpressions":[{"key":"serving.kubeflow.org/inferenceservice", "operator": "Exists"}]}}]}'

Possible fixes

  1. Add the label control-plane to the kubeflow namespace

    kubectl label namespace kubeflow control-plane=true
    

1. Change the namespaceSelector to be opt in; match namespaces with specific labels

    * This won't work because the changes will be overwritten when the controller restarts because the controller creates the webhook

Ref: https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#matching-requests-namespaceselector

Possible Work Arounds

  • Add the label control-plane to the kubeflow namespace
  • Update the inferservice webhook to change the namespace selector to be opt in.

A possible recipe

  1. Get the inference spec

    kubectl -n kubeflow get MutatingWebHookConfiguration inferenceservice.serving.kubeflow.org -o yaml > /tmp/inferceservice.yaml
    
  2. Change the matchSelector

    namespaceSelector:
      matchLabels:
         serving.kubeflow.org: "true"
    
  3. Apply it

    kubectl apply -f /tmp/inferenceservice.yaml
    
  4. Label any namespaces in which you want to use KFServing as

    kubectl label namespace ${NAMESPACE} serving.kubeflow.org=true
    

@animeshsingh @jlewi We encountered this issue when I testing KFP multi-user support, I actually just re-verified this with the latest changes on the manifests repo. The problem first comes when control-plane is enabled on the kubeflow namespace, this prevents the istio sidecar injection from working on that namespace, so in order to have mutli-user support for KFP we removed the label. I didn’t investigate any further why this happens, but I’d love to get some documentation on what KFServing is doing on those webooks.

The second issue is this deadlock outlined above, this happened when I deleted the kubeflow resources, then attempted to reinstall kubeflow (with istio already installed) this caused the widespread issue that prevented any pods from being created.

To avoid the deadlock and to have the sidecar injection on the kubeflow namespace, I had to re-apply the profile with istioctl (we are using 1.6), create the kubeflow namespace without the control-plane enabled and then proceed to install kubeflow, kfserving, knative etc…

It’s quite strange but I suspect you guys or other will run into this issues so I wanted to post this information here.

So this is related to https://github.com/kubeflow/kfserving/issues/480. As long as the kfserving controller is available then things work, even though it is looking at every Pod. The way that all Pod submissions can fail is if the controller itself isn’t available. When kubernetes tries to bring it back then the hook fires but the controller isn’t there to implement the hook so you get a catch-22.