kubernetes: A single resource cache sync failure shouldn't break the entire ResourceQuota controller logic
What happened:
- Had a failing conversion webhook for one of the CRDs
- ResourceQuota controller started throwing errors “timed out waiting for quota monitor sync”, and the resource subtraction logic stopped working: https://github.com/kubernetes/kubernetes/blob/f2ed1b55803177c9b02b0acf134a011a5fb20544/pkg/controller/resourcequota/resource_quota_controller.go#L448
What you expected to happen:
I think that a single cache failure for a countable resource shouldn’t break the entire resource controller logic. Or there should be a configuration to exclude a complimentary resource like CRDs, from the countable resource quota calculations. As in many cases people care about compute/storage quota enforcement the most.
I also accept that I might be missing something, and looking forward to understand the reasons behind the current design.
How to reproduce it (as minimally and precisely as possible):
- Create a resource quota with cpu requests/limits. Create pod with cpu requests/limits. Watch resource quota used field being updated.
- Add a conversion webhook for any CRD, and make this webhook fail. Make sure that you have a CRD object requiring conversion, present on the etcd to trigger the conversion.
- Delete the pod created on step 1). Watch that resourceQuota used field doesn’t get updated. You’ll start seeing “timed out waiting for quota monitor sync” in controller-manager logs.
Anything else we need to know?:
Same scenario breaks a GC controller as well.
Environment:
- Kubernetes version (use
kubectl version
): v1.18 or higher - Cloud provider or hardware configuration:
- OS (e.g:
cat /etc/os-release
): - Kernel (e.g.
uname -a
): - Install tools:
- Network plugin and version (if this is a network-related bug):
- Others:
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 8
- Comments: 24 (21 by maintainers)
No, that has always been the GC behavior