kubernetes: apiserver crashes on concurrent map iteration and write

What happened?

During periods when there’s webhook failures, it is possible for the apiserver to crash while generating an audit log entry for the given request. Specifically, when a large number of pods are deleted that are intercepted by a mutating webhook that is timing out, the following fatal error can be encountered:

E0905 18:47:26.898105      11 dispatcher.go:184] failed calling webhook "webhook.com": failed to call webhook: Post "https://webhook:443/mutate?timeout=12s": context deadline exceeded
fatal error: concurrent map iteration and map write

goroutine 295461 [running]:
reflect.mapiterinit(0x451512?, 0x200?, 0x48da4c0?)
    /usr/local/go/src/runtime/map.go:1375 +0x19
reflect.(*MapIter).Next(0x4528ac0?)
    /usr/local/go/src/reflect/value.go:1919 +0x65
encoding/json.mapEncoder.encode({0xc05cea2cf8?}, 0xc06363ae00, {0x4528ac0?, 0xc0b2bdf170?, 0xc0b2bdf0b0?}, {0xe?, 0x0?})
    /usr/local/go/src/encoding/json/encode.go:797 +0x33e
encoding/json.structEncoder.encode({{{0xc0208ea000?, 0x3dfdbc4?, 0x5?}, 0xc041e42b40?}}, 0xc06363ae00, {0x4e6fa60?, 0xc0b2bdf040?, 0x3dfdbc4?}, {0x0, 0x1})
    /usr/local/go/src/encoding/json/encode.go:759 +0x1f4
encoding/json.ptrEncoder.encode({0x0?}, 0xc06363ae00, {0x4e16cc0?, 0xc0b2bdf040?, 0x4e16cc0?}, {0xb6?, 0x68?})
    /usr/local/go/src/encoding/json/encode.go:943 +0x21c
encoding/json.(*encodeState).reflectValue(0xc05cea2fb0?, {0x4e16cc0?, 0xc0b2bdf040?, 0xc0ab3181a0?}, {0x60?, 0x2c?})
    /usr/local/go/src/encoding/json/encode.go:358 +0x78
encoding/json.(*encodeState).marshal(0x58d8a20?, {0x4e16cc0?, 0xc0b2bdf040?}, {0xb9?, 0xe?})
    /usr/local/go/src/encoding/json/encode.go:330 +0xfa
encoding/json.(*Encoder).Encode(0xc05cea30d0, {0x4e16cc0, 0xc0b2bdf040})
    /usr/local/go/src/encoding/json/stream.go:209 +0xf3
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/runtime/serializer/json.(*Serializer).doEncode(0x9cc5cf?, {0x58e09e0?, 0xc0b2bdf040?}, {0x58c99e0, 0xc0b2c2fc20})
    vendor/k8s.io/apimachinery/pkg/runtime/serializer/json/json.go:246 +0x19a
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/runtime/serializer/json.(*Serializer).Encode(0xc000344c30, {0x58e09e0, 0xc0b2bdf040}, {0x58c99e0, 0xc0b2c2fc20})
    vendor/k8s.io/apimachinery/pkg/runtime/serializer/json/json.go:220 +0xfc
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/runtime/serializer/versioning.(*codec).doEncode(0xc00093dc20, {0x58e0940, 0xc04f459a40}, {0x58c99e0, 0xc0b2c2fc20}, {0x0?, 0x0?})
    vendor/k8s.io/apimachinery/pkg/runtime/serializer/versioning/versioning.go:284 +0xa12
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/runtime/serializer/versioning.(*codec).encode(0xc00093dc20, {0x58e0940, 0xc04f459a40}, {0x58c99e0, 0xc0b2c2fc20}, {0x0?, 0x0?})
    vendor/k8s.io/apimachinery/pkg/runtime/serializer/versioning/versioning.go:214 +0x167
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/runtime/serializer/versioning.(*codec).Encode(0xc05cb67d10?, {0x58e0940?, 0xc04f459a40?}, {0x58c99e0?, 0xc0b2c2fc20?})
    vendor/k8s.io/apimachinery/pkg/runtime/serializer/versioning/versioning.go:207 +0x33
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/runtime.Encode({0x7f407ed689c8, 0xc00093dc20}, {0x58e0940, 0xc04f459a40})
    vendor/k8s.io/apimachinery/pkg/runtime/codec.go:50 +0x64
k8s.io/kubernetes/vendor/k8s.io/apiserver/plugin/pkg/audit/log.(*backend).logEvent(0xc000676750, 0xc04f459a40)
    vendor/k8s.io/apiserver/plugin/pkg/audit/log/backend.go:76 +0x73
k8s.io/kubernetes/vendor/k8s.io/apiserver/plugin/pkg/audit/log.(*backend).ProcessEvents(0x40e167?, {0xc09db7fcf8, 0x1, 0xc0b2bc6d01?})
    vendor/k8s.io/apiserver/plugin/pkg/audit/log/backend.go:65 +0x45
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/options.(*ignoreErrorsBackend).ProcessEvents(0x59013c0?, {0xc09db7fcf8?, 0x40e167?, 0x98?})
    vendor/k8s.io/apiserver/pkg/server/options/audit.go:395 +0x26
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.processAuditEvent({0x59013c0, 0xc05cb67d10}, {0x7f407ed6fd98, 0xc0012a1f40}, 0xc04f459a40, {0xc0012a1ed0?, 0x1, 0x8337e00?})
    vendor/k8s.io/apiserver/pkg/endpoints/filters/audit.go:202 +0x24f
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithAudit.func1.1()
    vendor/k8s.io/apiserver/pkg/endpoints/filters/audit.go:115 +0x41d
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithAudit.func1({0x5900650?, 0xc0ab318160}, 0xc098e8e800)
    vendor/k8s.io/apiserver/pkg/endpoints/filters/audit.go:118 +0x5c5
net/http.HandlerFunc.ServeHTTP(0xa?, {0x5900650?, 0xc0ab318160?}, 0xc0612f1a80?)
    /usr/local/go/src/net/http/server.go:2122 +0x2f
goroutine 295461 [running]:

This cluster is running v1.24.16. The panic is occurring on this line while generating the audit log entry: https://github.com/kubernetes/kubernetes/blob/v1.24.16/staging/src/k8s.io/apimachinery/pkg/runtime/serializer/json/json.go#L246

What did you expect to happen?

A panic should not be possible from the audit handler.

How can we reproduce it (as minimally and precisely as possible)?

We do not have a method to reproduce this deterministically. This has occurred during periods where 1000+ pods are deleted which are intercepted by a mutating webhook that is timing out.

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.16-eks-2d98532", GitCommit:"af930c12e26ef9d1e8fac7e3532ff4bcc1b2b509", GitTreeState:"clean", BuildDate:"2023-07-28T16:56:16Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.16-eks-2d98532", GitCommit:"af930c12e26ef9d1e8fac7e3532ff4bcc1b2b509", GitTreeState:"clean", BuildDate:"2023-07-28T16:52:47Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

AWS

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, …) and versions (if applicable)

About this issue

Original URL
State: open
Created 10 months ago
Comments: 19 (13 by maintainers)

Commits related to this issue

bugfix(audit): audit backend concurrent map r/w - #120507 reproduced — committed to sxllwx/kubernetes by sxllwx 10 months ago
bugfix(audit): audit backend concurrent map r/w - #120507 reproduced — committed to sxllwx/kubernetes by sxllwx 10 months ago

Most upvoted comments

I successfully reproduced the error using ut

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/120692/pull-kubernetes-unit/1702599330747125760

sxllwx on Sep 15, 2023

You should provide the version of the failed kube-apiserver (it would be better if you can locate the Commit)

I reproduced the fault through the following steps:

Compiled using v1.24.16 code
Use the compiled binary to start kubernetes locally (you need to configure a specific AuditPolicy for kube-apiserver, which I will put below) (you need to set kube-api-burst=2000, kube-api-qps=1000 of kube-controller-manager) , can reproduce faster)
Register a mutatingwebhookconfigurations (I’ll put them below)
Use a webhook-server that intentionally sleeps (time.hour * 4) in http.HandlerFunc
Create a Deployment with 1,000 copies (the created Pod can also be Pending)
Configured crontab (deleting Pod once every minute) and using label to delete Pod
Wait for the failure to occur

audit-policy.yaml:

apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: RequestResponse
  resources:
    - group: ""
      resources: ["pods"]

webhook.yaml

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  name: hook
webhooks:
- name: admission.example.io
  sideEffects:  None
  admissionReviewVersions: ["v1beta1"]
  failurePolicy: Ignore
  ...
  rules:
  - operations:
    - DELETE
    apiGroups:
    - ""
    apiVersions:
    - "v1"
    resources:
    - "pods"
---

Currently, I have successfully reproduced v1.24.16, but the master has not yet succeeded.

sxllwx on Oct 26, 2023

what is your audit configuration (all the --audit-... flags passed to kube-apiserver)? are you logging to an audit log file and an audit webhook?

This has occurred during periods where 1000+ pods are deleted which are intercepted by a mutating webhook that is timing out.

I’m not sure the mutation part is relevant… mutating webhooks can’t modify objects

I wonder if this is related to https://github.com/kubernetes/kubernetes/issues/82497, and if the issue is multiple encoding of the same audit event, or multiple encoding of the same object pointed at by the audit event

liggitt on Sep 7, 2023