kubernetes: Hitting 34s timeouts with server-side apply on large custom resource objects

What happened:

Performing server-side apply on a CR of at least 700KB in size results in 34 second timeout with Kubernetes 1.19.10+ and Kubernetes 1.20+.

What you expected to happen:

Like the creation and deletion events, I expected the update event to take a minimal amount of time. Understandably, our particular CR in question has a large status field (which is being trimmed down to reasonable size anyway). However, it is interesting that we have never faced a timeout here in Kubernetes v1.19.9 and below.

How to reproduce it (as minimally and precisely as possible):

  1. Use a Kubernetes 1.19.10 or 1.20.6 cluster
  2. Create a 700KB+ sized custom resource object, client-side (e.g. kubectl create) or server-side
  3. Update the object, client-side (e.g. kubectl edit) or server-side
  4. Server-side apply will timeout in 34s

Anything else we need to know?:

I’m currently testing server-side apply behavior on CRs with/without their openAPIV3Schema populated. It may be possible this issue only affects CRDs that preserve unknown fields for unpopulated openAPIV3Schema fields, and/or it only affects CRDs with large status fields:

schema:
  openAPIV3Schema:
    properties:
      spec:
        x-kubernetes-preserve-unknown-fields: true
      status:
        x-kubernetes-preserve-unknown-fields: true

This is a comment with details of our investigation for rancher/rancher. It may be relevant for further context, but I do not want to inundate maintainers with a huge comment if need be: https://github.com/rancher/rancher/issues/32419#issuecomment-849000867

I’ve also narrowed down some potential suspects, but have not yet been able to test them:

CRD in question:

apiVersion: apiextensions.k8s.io/v1                                                                                                                                                           
kind: CustomResourceDefinition                                                                                                                                                                
metadata:                                                                                                                                                                                                                                                                                                                                                          
  name: catalogs.management.cattle.io                                                                                                                                                         
spec:
  conversion:
    strategy: None
  group: management.cattle.io
  names:
    kind: Catalog
    listKind: CatalogList
    plural: catalogs
    singular: catalog
  preserveUnknownFields: true
  scope: Cluster
  versions:
  - name: v3
    served: true
    storage: true

Environment:

  • Kubernetes version (use kubectl version): 1.19.10, 1.20.6 (server)
  • Cloud provider or hardware configuration: Digital Ocean Droplet (2 CPU / 8GB RAM)
  • OS (e.g: cat /etc/os-release): Ubuntu 20.04 LTS
  • Kernel (e.g. uname -a): 5.4
  • Install tools: RKE v1.2.8
  • Network plugin and version (if this is a network-related bug): n/a
  • Others: n/a

Thank you in advance for any and all help! I’d be happy to provide more detail that may help, and I hope to even find the code that caused this as well. This is an ongoing investigation, but I thought I’d file the issue since I believe we have had enough reproduction scenarios to warrant so.

EDIT: I believe wg-api-expression was the best to assign based on this: https://github.com/kubernetes-sigs/structured-merge-diff#community-discussion-contribution-and-support

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 16 (9 by maintainers)

Commits related to this issue

Most upvoted comments

@nickgerace there is a mitigation for this issue that works on v1.20+ (but not 1.19): Use x-kubernetes-map-type: atomic, e.g.:

schema:
  openAPIV3Schema:
    properties:
      spec:
        x-kubernetes-preserve-unknown-fields: true
        x-kubernetes-map-type: atomic
      status:
        x-kubernetes-preserve-unknown-fields: true
        x-kubernetes-map-type: atomic
        

I’m in favor of back porting this as far as we possibly can. Once the fix is merged and the PR to version bump structured-merge-diff is open against github.com/kubernetes/kubernetes I’ll open the cherry-pick requests.

I suspect this is due to ReconcileFieldSetWithSchema being run on all updates. For types with no schema (or that makes heavy use of x-kubernetes-preserve-unknown-fields: true) ReconcileFieldSetWithSchema needs to be skipped. It already tries to bail out early for deduced schemas (https://github.com/kubernetes-sigs/structured-merge-diff/blob/ea1021dbc0f242313159d5dd4801ff29304712fe/typed/reconcile_schema.go#L130), but I’m not convinced that’s working right, and I don’t think it covers this case.

/assign