kubernetes: Namespace stuck in Terminating when deleted if ApiService doesn't implement Aggregated Discovery

What happened?

We implement a manual APIService API Extension in Agones.

Definition: https://github.com/googleforgames/agones/blob/main/install/helm/agones/templates/service/allocation.yaml

Code for handling web requests: https://github.com/googleforgames/agones/blob/main/pkg/util/apiserver/apiserver.go

The code is extremely lightweight as we only have no need for storage, and only accept CREATE requests at this time.

This issue was first reported by one of our users in May, but we only were able to reproduce it ourselves once we were on 1.27.x: https://github.com/googleforgames/agones/issues/3172

Testing on Kubernetes 1.27.x we noticed that when attempting to delete Namespaces, they would get stuck in termination with the following description:

❯ kubectl describe ns 1690585578
Name:         1690585578
Labels:       kubernetes.io/metadata.name=1690585578
              owner=e2e-test
Annotations:  <none>
Status:       Terminating
Conditions:
  Type                                         Status  LastTransitionTime               Reason                  Message
  ----                                         ------  ------------------               ------                  -------
  NamespaceDeletionDiscoveryFailure            True    Fri, 28 Jul 2023 16:07:44 -0700  DiscoveryFailed         Discovery failed for some groups, 1 failing: unable to retrieve the complete list of server APIs: allocation.agones.dev/v1: stale GroupVersion discovery: allocation.agones.dev/v1
  NamespaceDeletionGroupVersionParsingFailure  False   Fri, 28 Jul 2023 16:07:46 -0700  ParsedGroupVersions     All legacy kube types successfully parsed
  NamespaceDeletionContentFailure              False   Fri, 28 Jul 2023 16:08:56 -0700  ContentDeleted          All content successfully deleted, may be waiting on finalization
  NamespaceContentRemaining                    False   Fri, 28 Jul 2023 16:08:56 -0700  ContentRemoved          All content successfully removed
  NamespaceFinalizersRemaining                 False   Fri, 28 Jul 2023 16:08:56 -0700  ContentHasNoFinalizers  All content-preserving finalizers finished

So far, I’ve not found a way to delete the Namespace.

Looking through the log for the webserver, i can see requests for /apis and the new Aggregated Discovery feature (Accept: "application/json;g=apidiscovery.k8s.io;v=v2beta1;as=APIGroupDiscoveryList"), but we return a 404 to that response, since it’s not implemented.

Looking through the code I can find (https://github.com/kubernetes/kubernetes/blob/v1.27.3/staging/src/k8s.io/kube-aggregator/pkg/apiserver/handler_discovery.go) it seems that any response other than http.StatusOK would result in this issue – which is breaking backward compatibility within APIService to assume that this API surface is implemented in some way.

What did you expect to happen?

Namespaces would terminate as per normal.

How can we reproduce it (as minimally and precisely as possible)?

  1. Install Agones on a cluster (https://agones.dev/site/docs/installation/install-agones/)
  2. kubectl create ns foo
  3. kubectl delete ns foo
  4. Watch as the ns gets stuck in Terminating.

Anything else we need to know?

If I had a magic wand, I’d love a reference of all the API’s that get called against an APIService and what their expected results should be.

So far it’s a combo of using kubectl proxy and looking at the k8s API responses, crawling through code, and reading apiserver logs to try and reverse engineer this work.

Kubernetes version

❯ kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-14T09:53:42Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.2-gke.1200", GitCommit:"5319597f0ffe6e93e83a51e280d81fb2028bf4a0", GitTreeState:"clean", BuildDate:"2023-06-01T19:54:16Z", GoVersion:"go1.20.4 X:boringcrypto", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

Google Kubernetes Engine

OS version

# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux trixie/sid"
NAME="Debian GNU/Linux"
VERSION_CODENAME=trixie
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
$ uname -a
Linux markmandel 6.3.0-1-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.3.7-1 (2023-06-12) x86_64 GNU/Linux

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, …) and versions (if applicable)

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 3
  • Comments: 16 (14 by maintainers)

Commits related to this issue

Most upvoted comments

Oh yes, have an implementation and was working on the tests. Will send it out today

Yeah I agree, we made a couple assumptions based on the sample apiserver that could be improved for stronger compatibility guarantees. Will look into sending a PR for this once code freeze is over.

if possible, let’s get that open for review earlier, and craft it to be as minimal a change as possible, since I assume we’ll backport it to 1.28 and 1.27