kubernetes: apiserver crashes on concurrent map iteration and write
What happened?
During periods when there’s webhook failures, it is possible for the apiserver to crash while generating an audit log entry for the given request. Specifically, when a large number of pods are deleted that are intercepted by a mutating webhook that is timing out, the following fatal error can be encountered:
E0905 18:47:26.898105      11 dispatcher.go:184] failed calling webhook "webhook.com": failed to call webhook: Post "https://webhook:443/mutate?timeout=12s": context deadline exceeded
fatal error: concurrent map iteration and map write
goroutine 295461 [running]:
reflect.mapiterinit(0x451512?, 0x200?, 0x48da4c0?)
    /usr/local/go/src/runtime/map.go:1375 +0x19
reflect.(*MapIter).Next(0x4528ac0?)
    /usr/local/go/src/reflect/value.go:1919 +0x65
encoding/json.mapEncoder.encode({0xc05cea2cf8?}, 0xc06363ae00, {0x4528ac0?, 0xc0b2bdf170?, 0xc0b2bdf0b0?}, {0xe?, 0x0?})
    /usr/local/go/src/encoding/json/encode.go:797 +0x33e
encoding/json.structEncoder.encode({{{0xc0208ea000?, 0x3dfdbc4?, 0x5?}, 0xc041e42b40?}}, 0xc06363ae00, {0x4e6fa60?, 0xc0b2bdf040?, 0x3dfdbc4?}, {0x0, 0x1})
    /usr/local/go/src/encoding/json/encode.go:759 +0x1f4
encoding/json.ptrEncoder.encode({0x0?}, 0xc06363ae00, {0x4e16cc0?, 0xc0b2bdf040?, 0x4e16cc0?}, {0xb6?, 0x68?})
    /usr/local/go/src/encoding/json/encode.go:943 +0x21c
encoding/json.(*encodeState).reflectValue(0xc05cea2fb0?, {0x4e16cc0?, 0xc0b2bdf040?, 0xc0ab3181a0?}, {0x60?, 0x2c?})
    /usr/local/go/src/encoding/json/encode.go:358 +0x78
encoding/json.(*encodeState).marshal(0x58d8a20?, {0x4e16cc0?, 0xc0b2bdf040?}, {0xb9?, 0xe?})
    /usr/local/go/src/encoding/json/encode.go:330 +0xfa
encoding/json.(*Encoder).Encode(0xc05cea30d0, {0x4e16cc0, 0xc0b2bdf040})
    /usr/local/go/src/encoding/json/stream.go:209 +0xf3
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/runtime/serializer/json.(*Serializer).doEncode(0x9cc5cf?, {0x58e09e0?, 0xc0b2bdf040?}, {0x58c99e0, 0xc0b2c2fc20})
    vendor/k8s.io/apimachinery/pkg/runtime/serializer/json/json.go:246 +0x19a
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/runtime/serializer/json.(*Serializer).Encode(0xc000344c30, {0x58e09e0, 0xc0b2bdf040}, {0x58c99e0, 0xc0b2c2fc20})
    vendor/k8s.io/apimachinery/pkg/runtime/serializer/json/json.go:220 +0xfc
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/runtime/serializer/versioning.(*codec).doEncode(0xc00093dc20, {0x58e0940, 0xc04f459a40}, {0x58c99e0, 0xc0b2c2fc20}, {0x0?, 0x0?})
    vendor/k8s.io/apimachinery/pkg/runtime/serializer/versioning/versioning.go:284 +0xa12
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/runtime/serializer/versioning.(*codec).encode(0xc00093dc20, {0x58e0940, 0xc04f459a40}, {0x58c99e0, 0xc0b2c2fc20}, {0x0?, 0x0?})
    vendor/k8s.io/apimachinery/pkg/runtime/serializer/versioning/versioning.go:214 +0x167
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/runtime/serializer/versioning.(*codec).Encode(0xc05cb67d10?, {0x58e0940?, 0xc04f459a40?}, {0x58c99e0?, 0xc0b2c2fc20?})
    vendor/k8s.io/apimachinery/pkg/runtime/serializer/versioning/versioning.go:207 +0x33
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/runtime.Encode({0x7f407ed689c8, 0xc00093dc20}, {0x58e0940, 0xc04f459a40})
    vendor/k8s.io/apimachinery/pkg/runtime/codec.go:50 +0x64
k8s.io/kubernetes/vendor/k8s.io/apiserver/plugin/pkg/audit/log.(*backend).logEvent(0xc000676750, 0xc04f459a40)
    vendor/k8s.io/apiserver/plugin/pkg/audit/log/backend.go:76 +0x73
k8s.io/kubernetes/vendor/k8s.io/apiserver/plugin/pkg/audit/log.(*backend).ProcessEvents(0x40e167?, {0xc09db7fcf8, 0x1, 0xc0b2bc6d01?})
    vendor/k8s.io/apiserver/plugin/pkg/audit/log/backend.go:65 +0x45
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/options.(*ignoreErrorsBackend).ProcessEvents(0x59013c0?, {0xc09db7fcf8?, 0x40e167?, 0x98?})
    vendor/k8s.io/apiserver/pkg/server/options/audit.go:395 +0x26
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.processAuditEvent({0x59013c0, 0xc05cb67d10}, {0x7f407ed6fd98, 0xc0012a1f40}, 0xc04f459a40, {0xc0012a1ed0?, 0x1, 0x8337e00?})
    vendor/k8s.io/apiserver/pkg/endpoints/filters/audit.go:202 +0x24f
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithAudit.func1.1()
    vendor/k8s.io/apiserver/pkg/endpoints/filters/audit.go:115 +0x41d
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithAudit.func1({0x5900650?, 0xc0ab318160}, 0xc098e8e800)
    vendor/k8s.io/apiserver/pkg/endpoints/filters/audit.go:118 +0x5c5
net/http.HandlerFunc.ServeHTTP(0xa?, {0x5900650?, 0xc0ab318160?}, 0xc0612f1a80?)
    /usr/local/go/src/net/http/server.go:2122 +0x2f
goroutine 295461 [running]:
This cluster is running v1.24.16. The panic is occurring on this line while generating the audit log entry: https://github.com/kubernetes/kubernetes/blob/v1.24.16/staging/src/k8s.io/apimachinery/pkg/runtime/serializer/json/json.go#L246
What did you expect to happen?
A panic should not be possible from the audit handler.
How can we reproduce it (as minimally and precisely as possible)?
We do not have a method to reproduce this deterministically. This has occurred during periods where 1000+ pods are deleted which are intercepted by a mutating webhook that is timing out.
Anything else we need to know?
No response
Kubernetes version
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.16-eks-2d98532", GitCommit:"af930c12e26ef9d1e8fac7e3532ff4bcc1b2b509", GitTreeState:"clean", BuildDate:"2023-07-28T16:56:16Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.16-eks-2d98532", GitCommit:"af930c12e26ef9d1e8fac7e3532ff4bcc1b2b509", GitTreeState:"clean", BuildDate:"2023-07-28T16:52:47Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider
OS version
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, …) and versions (if applicable)
About this issue
- Original URL
 - State: open
 - Created 10 months ago
 - Comments: 19 (13 by maintainers)
 
Commits related to this issue
- bugfix(audit): audit backend concurrent map r/w - #120507 reproduced — committed to sxllwx/kubernetes by sxllwx 10 months ago
 - bugfix(audit): audit backend concurrent map r/w - #120507 reproduced — committed to sxllwx/kubernetes by sxllwx 10 months ago
 
I successfully reproduced the error using ut
https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/120692/pull-kubernetes-unit/1702599330747125760
I reproduced the fault through the following steps:
audit-policy.yaml:
webhook.yaml
Currently, I have successfully reproduced v1.24.16, but the master has not yet succeeded.
what is your audit configuration (all the
--audit-...flags passed to kube-apiserver)? are you logging to an audit log file and an audit webhook?I’m not sure the mutation part is relevant… mutating webhooks can’t modify objects
I wonder if this is related to https://github.com/kubernetes/kubernetes/issues/82497, and if the issue is multiple encoding of the same audit event, or multiple encoding of the same object pointed at by the audit event