kubernetes: Hitting 34s timeouts with server-side apply on large custom resource objects

What happened:

Performing server-side apply on a CR of at least 700KB in size results in 34 second timeout with Kubernetes 1.19.10+ and Kubernetes 1.20+.

What you expected to happen:

Like the creation and deletion events, I expected the update event to take a minimal amount of time. Understandably, our particular CR in question has a large status field (which is being trimmed down to reasonable size anyway). However, it is interesting that we have never faced a timeout here in Kubernetes v1.19.9 and below.

How to reproduce it (as minimally and precisely as possible):

Use a Kubernetes 1.19.10 or 1.20.6 cluster
Create a 700KB+ sized custom resource object, client-side (e.g. kubectl create) or server-side
Update the object, client-side (e.g. kubectl edit) or server-side
Server-side apply will timeout in 34s

Anything else we need to know?:

I’m currently testing server-side apply behavior on CRs with/without their openAPIV3Schema populated. It may be possible this issue only affects CRDs that preserve unknown fields for unpopulated openAPIV3Schema fields, and/or it only affects CRDs with large status fields:

schema:
  openAPIV3Schema:
    properties:
      spec:
        x-kubernetes-preserve-unknown-fields: true
      status:
        x-kubernetes-preserve-unknown-fields: true

This is a comment with details of our investigation for rancher/rancher. It may be relevant for further context, but I do not want to inundate maintainers with a huge comment if need be: https://github.com/rancher/rancher/issues/32419#issuecomment-849000867

I’ve also narrowed down some potential suspects, but have not yet been able to test them:

Structured Merge Diff v4.0.2: Possible (https://github.com/kubernetes-sigs/structured-merge-diff/compare/v4.0.1..v4.0.2)
Lease Manager Changes in apiserver: Doubtful (https://github.com/kubernetes/kubernetes/commit/3820a5d819582d4f6b4652757a3cc6c1a04692b4, https://github.com/kubernetes/kubernetes/commit/537b8d3c06e869a6c00e9b645607ab03a09c9adb, https://github.com/kubernetes/kubernetes/commit/42a3d75bf87a59c402ac72dd369ffe898cf8eb7b)
Protobuf v1.3.2: Doubtful (https://github.com/kubernetes/kubernetes/commit/15596cedd26c3afacf719a92c79c28e51051a959)

CRD in question:

Source: https://github.com/rancher/rancher/blob/v2.5.8/pkg/apis/management.cattle.io/v3/catalog_types.go
In cluster:

apiVersion: apiextensions.k8s.io/v1                                                                                                                                                           
kind: CustomResourceDefinition                                                                                                                                                                
metadata:                                                                                                                                                                                                                                                                                                                                                          
  name: catalogs.management.cattle.io                                                                                                                                                         
spec:
  conversion:
    strategy: None
  group: management.cattle.io
  names:
    kind: Catalog
    listKind: CatalogList
    plural: catalogs
    singular: catalog
  preserveUnknownFields: true
  scope: Cluster
  versions:
  - name: v3
    served: true
    storage: true

Environment:

Kubernetes version (use kubectl version): 1.19.10, 1.20.6 (server)
Cloud provider or hardware configuration: Digital Ocean Droplet (2 CPU / 8GB RAM)
OS (e.g: cat /etc/os-release): Ubuntu 20.04 LTS
Kernel (e.g. uname -a): 5.4
Install tools: RKE v1.2.8
Network plugin and version (if this is a network-related bug): n/a
Others: n/a

Thank you in advance for any and all help! I’d be happy to provide more detail that may help, and I hope to even find the code that caused this as well. This is an ongoing investigation, but I thought I’d file the issue since I believe we have had enough reproduction scenarios to warrant so.

EDIT: I believe wg-api-expression was the best to assign based on this: https://github.com/kubernetes-sigs/structured-merge-diff#community-discussion-contribution-and-support

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 16 (9 by maintainers)

Commits related to this issue

Bump SMD to v4.1.2 to pick up #102749 fix — committed to jpbetz/kubernetes by jpbetz 3 years ago
Bump SMD to v4.1.2 to pick up #102749 fix — committed to jpbetz/kubernetes by jpbetz 3 years ago
Bump SMD to v4.1.2 to pick up #102749 fix — committed to jpbetz/kubernetes by jpbetz 3 years ago
Bump SMD to v4.1.2 to pick up #102749 fix — committed to jpbetz/kubernetes by jpbetz 3 years ago
Merge pull request #103318 from jpbetz/fix-102749 Bump SMD to v4.1.2 to pick up #102749 fix — committed to kubernetes/kubernetes by k8s-ci-robot 3 years ago
Merge pull request #103321 from jpbetz/fix-102749-1.19 Manual cherry pick of #103318: Bump SMD to v4.1.2 to pick up #102749 fix — committed to kubernetes/kubernetes by k8s-ci-robot 3 years ago
Merge pull request #103320 from jpbetz/fix-102749-1.20 Manual cherry pick of #103318: Bump SMD to v4.1.2 to pick up #102749 fix — committed to kubernetes/kubernetes by k8s-ci-robot 3 years ago
Merge pull request #103319 from jpbetz/fix-102749-1.21 Manual cherry pick of #103318: Bump SMD to v4.1.2 to pick up #102749 fix — committed to kubernetes/kubernetes by k8s-ci-robot 3 years ago

Most upvoted comments

I’ve opened PRs to fix and backport this:

main branch: https://github.com/kubernetes/kubernetes/pull/103318 1.21: https://github.com/kubernetes/kubernetes/pull/103319 1.20: https://github.com/kubernetes/kubernetes/pull/103320 1.19: https://github.com/kubernetes/kubernetes/pull/103321

jpbetz on Jun 30, 2021

@nickgerace there is a mitigation for this issue that works on v1.20+ (but not 1.19): Use x-kubernetes-map-type: atomic, e.g.:

schema:
  openAPIV3Schema:
    properties:
      spec:
        x-kubernetes-preserve-unknown-fields: true
        x-kubernetes-map-type: atomic
      status:
        x-kubernetes-preserve-unknown-fields: true
        x-kubernetes-map-type: atomic

jpbetz on Jun 26, 2021

I’m in favor of back porting this as far as we possibly can. Once the fix is merged and the PR to version bump structured-merge-diff is open against github.com/kubernetes/kubernetes I’ll open the cherry-pick requests.

jpbetz on Jun 25, 2021

I suspect this is due to ReconcileFieldSetWithSchema being run on all updates. For types with no schema (or that makes heavy use of x-kubernetes-preserve-unknown-fields: true) ReconcileFieldSetWithSchema needs to be skipped. It already tries to bail out early for deduced schemas (https://github.com/kubernetes-sigs/structured-merge-diff/blob/ea1021dbc0f242313159d5dd4801ff29304712fe/typed/reconcile_schema.go#L130), but I’m not convinced that’s working right, and I don’t think it covers this case.

/assign

jpbetz on Jun 24, 2021