kubernetes: A single resource cache sync failure shouldn't break the entire ResourceQuota controller logic

What happened:

Had a failing conversion webhook for one of the CRDs
ResourceQuota controller started throwing errors “timed out waiting for quota monitor sync”, and the resource subtraction logic stopped working: https://github.com/kubernetes/kubernetes/blob/f2ed1b55803177c9b02b0acf134a011a5fb20544/pkg/controller/resourcequota/resource_quota_controller.go#L448

What you expected to happen:

I think that a single cache failure for a countable resource shouldn’t break the entire resource controller logic. Or there should be a configuration to exclude a complimentary resource like CRDs, from the countable resource quota calculations. As in many cases people care about compute/storage quota enforcement the most.

I also accept that I might be missing something, and looking forward to understand the reasons behind the current design.

How to reproduce it (as minimally and precisely as possible):

Create a resource quota with cpu requests/limits. Create pod with cpu requests/limits. Watch resource quota used field being updated.
Add a conversion webhook for any CRD, and make this webhook fail. Make sure that you have a CRD object requiring conversion, present on the etcd to trigger the conversion.
Delete the pod created on step 1). Watch that resourceQuota used field doesn’t get updated. You’ll start seeing “timed out waiting for quota monitor sync” in controller-manager logs.

Anything else we need to know?:

Same scenario breaks a GC controller as well.

Environment:

Kubernetes version (use kubectl version): v1.18 or higher
Cloud provider or hardware configuration:
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Network plugin and version (if this is a network-related bug):
Others:

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 8
Comments: 24 (21 by maintainers)

Most upvoted comments

No, that has always been the GC behavior

liggitt on Nov 4, 2020