kubernetes: Hitting 34s timeouts with server-side apply on large custom resource objects
What happened:
Performing server-side apply on a CR of at least 700KB in size results in 34 second timeout with Kubernetes 1.19.10+ and Kubernetes 1.20+.
What you expected to happen:
Like the creation and deletion events, I expected the update event to take a minimal amount of time. Understandably, our particular CR in question has a large status field (which is being trimmed down to reasonable size anyway). However, it is interesting that we have never faced a timeout here in Kubernetes v1.19.9 and below.
How to reproduce it (as minimally and precisely as possible):
- Use a Kubernetes 1.19.10 or 1.20.6 cluster
- Create a 700KB+ sized custom resource object, client-side (e.g.
kubectl create) or server-side - Update the object, client-side (e.g.
kubectl edit) or server-side - Server-side apply will timeout in 34s
Anything else we need to know?:
I’m currently testing server-side apply behavior on CRs with/without their openAPIV3Schema populated. It may be possible this issue only affects CRDs that preserve unknown fields for unpopulated openAPIV3Schema fields, and/or it only affects CRDs with large status fields:
schema:
openAPIV3Schema:
properties:
spec:
x-kubernetes-preserve-unknown-fields: true
status:
x-kubernetes-preserve-unknown-fields: true
This is a comment with details of our investigation for rancher/rancher. It may be relevant for further context, but I do not want to inundate maintainers with a huge comment if need be: https://github.com/rancher/rancher/issues/32419#issuecomment-849000867
I’ve also narrowed down some potential suspects, but have not yet been able to test them:
- Structured Merge Diff v4.0.2: Possible (https://github.com/kubernetes-sigs/structured-merge-diff/compare/v4.0.1..v4.0.2)
- Lease Manager Changes in apiserver: Doubtful (https://github.com/kubernetes/kubernetes/commit/3820a5d819582d4f6b4652757a3cc6c1a04692b4, https://github.com/kubernetes/kubernetes/commit/537b8d3c06e869a6c00e9b645607ab03a09c9adb, https://github.com/kubernetes/kubernetes/commit/42a3d75bf87a59c402ac72dd369ffe898cf8eb7b)
- Protobuf v1.3.2: Doubtful (https://github.com/kubernetes/kubernetes/commit/15596cedd26c3afacf719a92c79c28e51051a959)
CRD in question:
- Source: https://github.com/rancher/rancher/blob/v2.5.8/pkg/apis/management.cattle.io/v3/catalog_types.go
- In cluster:
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: catalogs.management.cattle.io
spec:
conversion:
strategy: None
group: management.cattle.io
names:
kind: Catalog
listKind: CatalogList
plural: catalogs
singular: catalog
preserveUnknownFields: true
scope: Cluster
versions:
- name: v3
served: true
storage: true
Environment:
- Kubernetes version (use
kubectl version): 1.19.10, 1.20.6 (server) - Cloud provider or hardware configuration: Digital Ocean Droplet (2 CPU / 8GB RAM)
- OS (e.g:
cat /etc/os-release): Ubuntu 20.04 LTS - Kernel (e.g.
uname -a): 5.4 - Install tools: RKE v1.2.8
- Network plugin and version (if this is a network-related bug): n/a
- Others: n/a
Thank you in advance for any and all help! I’d be happy to provide more detail that may help, and I hope to even find the code that caused this as well. This is an ongoing investigation, but I thought I’d file the issue since I believe we have had enough reproduction scenarios to warrant so.
EDIT: I believe wg-api-expression was the best to assign based on this: https://github.com/kubernetes-sigs/structured-merge-diff#community-discussion-contribution-and-support
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 16 (9 by maintainers)
Commits related to this issue
- Bump SMD to v4.1.2 to pick up #102749 fix — committed to jpbetz/kubernetes by jpbetz 3 years ago
- Bump SMD to v4.1.2 to pick up #102749 fix — committed to jpbetz/kubernetes by jpbetz 3 years ago
- Bump SMD to v4.1.2 to pick up #102749 fix — committed to jpbetz/kubernetes by jpbetz 3 years ago
- Bump SMD to v4.1.2 to pick up #102749 fix — committed to jpbetz/kubernetes by jpbetz 3 years ago
- Merge pull request #103318 from jpbetz/fix-102749 Bump SMD to v4.1.2 to pick up #102749 fix — committed to kubernetes/kubernetes by k8s-ci-robot 3 years ago
- Merge pull request #103321 from jpbetz/fix-102749-1.19 Manual cherry pick of #103318: Bump SMD to v4.1.2 to pick up #102749 fix — committed to kubernetes/kubernetes by k8s-ci-robot 3 years ago
- Merge pull request #103320 from jpbetz/fix-102749-1.20 Manual cherry pick of #103318: Bump SMD to v4.1.2 to pick up #102749 fix — committed to kubernetes/kubernetes by k8s-ci-robot 3 years ago
- Merge pull request #103319 from jpbetz/fix-102749-1.21 Manual cherry pick of #103318: Bump SMD to v4.1.2 to pick up #102749 fix — committed to kubernetes/kubernetes by k8s-ci-robot 3 years ago
I’ve opened PRs to fix and backport this:
main branch: https://github.com/kubernetes/kubernetes/pull/103318 1.21: https://github.com/kubernetes/kubernetes/pull/103319 1.20: https://github.com/kubernetes/kubernetes/pull/103320 1.19: https://github.com/kubernetes/kubernetes/pull/103321
@nickgerace there is a mitigation for this issue that works on v1.20+ (but not 1.19): Use x-kubernetes-map-type: atomic, e.g.:
I’m in favor of back porting this as far as we possibly can. Once the fix is merged and the PR to version bump structured-merge-diff is open against github.com/kubernetes/kubernetes I’ll open the cherry-pick requests.
I suspect this is due to ReconcileFieldSetWithSchema being run on all updates. For types with no schema (or that makes heavy use of x-kubernetes-preserve-unknown-fields: true) ReconcileFieldSetWithSchema needs to be skipped. It already tries to bail out early for deduced schemas (https://github.com/kubernetes-sigs/structured-merge-diff/blob/ea1021dbc0f242313159d5dd4801ff29304712fe/typed/reconcile_schema.go#L130), but I’m not convinced that’s working right, and I don’t think it covers this case.
/assign