kubernetes: Namespace stuck in Terminating when deleted if ApiService doesn't implement Aggregated Discovery
What happened?
We implement a manual APIService
API Extension in Agones.
Definition: https://github.com/googleforgames/agones/blob/main/install/helm/agones/templates/service/allocation.yaml
Code for handling web requests: https://github.com/googleforgames/agones/blob/main/pkg/util/apiserver/apiserver.go
The code is extremely lightweight as we only have no need for storage, and only accept CREATE requests at this time.
This issue was first reported by one of our users in May, but we only were able to reproduce it ourselves once we were on 1.27.x: https://github.com/googleforgames/agones/issues/3172
Testing on Kubernetes 1.27.x we noticed that when attempting to delete Namespaces, they would get stuck in termination with the following description:
❯ kubectl describe ns 1690585578
Name: 1690585578
Labels: kubernetes.io/metadata.name=1690585578
owner=e2e-test
Annotations: <none>
Status: Terminating
Conditions:
Type Status LastTransitionTime Reason Message
---- ------ ------------------ ------ -------
NamespaceDeletionDiscoveryFailure True Fri, 28 Jul 2023 16:07:44 -0700 DiscoveryFailed Discovery failed for some groups, 1 failing: unable to retrieve the complete list of server APIs: allocation.agones.dev/v1: stale GroupVersion discovery: allocation.agones.dev/v1
NamespaceDeletionGroupVersionParsingFailure False Fri, 28 Jul 2023 16:07:46 -0700 ParsedGroupVersions All legacy kube types successfully parsed
NamespaceDeletionContentFailure False Fri, 28 Jul 2023 16:08:56 -0700 ContentDeleted All content successfully deleted, may be waiting on finalization
NamespaceContentRemaining False Fri, 28 Jul 2023 16:08:56 -0700 ContentRemoved All content successfully removed
NamespaceFinalizersRemaining False Fri, 28 Jul 2023 16:08:56 -0700 ContentHasNoFinalizers All content-preserving finalizers finished
So far, I’ve not found a way to delete the Namespace.
Looking through the log for the webserver, i can see requests for /apis
and the new Aggregated Discovery feature (Accept: "application/json;g=apidiscovery.k8s.io;v=v2beta1;as=APIGroupDiscoveryList"
), but we return a 404 to that response, since it’s not implemented.
Looking through the code I can find (https://github.com/kubernetes/kubernetes/blob/v1.27.3/staging/src/k8s.io/kube-aggregator/pkg/apiserver/handler_discovery.go) it seems that any response other than http.StatusOK
would result in this issue – which is breaking backward compatibility within APIService
to assume that this API surface is implemented in some way.
What did you expect to happen?
Namespaces would terminate as per normal.
How can we reproduce it (as minimally and precisely as possible)?
- Install Agones on a cluster (https://agones.dev/site/docs/installation/install-agones/)
kubectl create ns foo
kubectl delete ns foo
- Watch as the ns gets stuck in Terminating.
Anything else we need to know?
If I had a magic wand, I’d love a reference of all the API’s that get called against an APIService and what their expected results should be.
So far it’s a combo of using kubectl proxy
and looking at the k8s API responses, crawling through code, and reading apiserver logs to try and reverse engineer this work.
Kubernetes version
❯ kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-14T09:53:42Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.2-gke.1200", GitCommit:"5319597f0ffe6e93e83a51e280d81fb2028bf4a0", GitTreeState:"clean", BuildDate:"2023-06-01T19:54:16Z", GoVersion:"go1.20.4 X:boringcrypto", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider
OS version
# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux trixie/sid"
NAME="Debian GNU/Linux"
VERSION_CODENAME=trixie
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
$ uname -a
Linux markmandel 6.3.0-1-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.3.7-1 (2023-06-12) x86_64 GNU/Linux
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, …) and versions (if applicable)
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 3
- Comments: 16 (14 by maintainers)
Commits related to this issue
- APIService: Updates to handlers for 1.27.x This includes a bug fix for the issue outlined in https://github.com/kubernetes/kubernetes/issues/119662, specifically returning HTTP 406 for root /apis for... — committed to markmandel/agones by markmandel a year ago
- APIService: Updates to handlers for 1.27.x (#3297) * APIService: Updates to handlers for 1.27.x This includes a bug fix for the issue outlined in https://github.com/kubernetes/kubernetes/issues/1... — committed to googleforgames/agones by markmandel a year ago
Oh yes, have an implementation and was working on the tests. Will send it out today
if possible, let’s get that open for review earlier, and craft it to be as minimal a change as possible, since I assume we’ll backport it to 1.28 and 1.27