kyverno: [Bug] admission reports piled up causing etcd turned into read-only mode

Kyverno Version

1.10.3

Description

Follow up issue report from slack discussion

image

1.24 EKS cluster

# HELP apiserver_storage_objects [STABLE] Number of stored objects at the time of last check split by kind.
# TYPE apiserver_storage_objects gauge
apiserver_storage_objects{resource="admissionreports.kyverno.io"} 1.601408e+06

Millions of kyverno admission reports piled up since June, 2023 and they occupied most of the space in etcd db. It breached the upstream recommended maximum db size quota (8G) and then turned the etcd into read-only mode.

Entries by 'Kind' (total 9.5 GB):
+--------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+--------+
|                                                                       KEY GROUP                                                                        |              KIND               |  SIZE  |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+--------+
| /registry/kyverno.io/admissionreports/monitoring,/registry/kyverno.io/admissionreports/monitoring,/registry/kyverno.io/admissionreports/monitoring,/re | AdmissionReport                 | 9.4 GB |

kyverno-app-controller-pod-spec.yaml was the pod spec when the db was filled up while I am not sure if the user has ever upgraded the controller version in the past since June, 2023. The 1.10.3 Kyverno Version is fetched from ghcr.io/kyverno/kyverno:v1.10.3 in this spec.

kyverno-admission-report-sample.json was one of the example admission report custom resources.

Please let me know if kyverno community wants more information like apiserver audit log or other admission report samples.

Slack discussion

https://kubernetes.slack.com/archives/CLGR9BJU9/p1700252421515759

Troubleshooting

  • I have read and followed the documentation AND the troubleshooting guide.
  • I have searched other issues in this repository and mine is not recorded.

About this issue

  • Original URL
  • State: closed
  • Created 7 months ago
  • Comments: 17 (10 by maintainers)

Most upvoted comments

All of the clean up reports pods were either OOMKilled or had Error

Great, now we know why admission reports were piled up.

@KhaledEmaraDev - can we perform the load testing against Kyverno 1.10.x and capture the cronjob resource usage based on various loads?

The deletion script in the cronjob spec is also not as efficient as EKS blog post about managing etcd db size > How to reclaim etcd database space? section.

Thanks for the pointer. In Kyverno 1.10.x, there are “aggregate” and “non-aggregate” admission reports. The stale non-aggregate admission reports are cleaned up by using the label as you can see here. With 1.11.x, the admission reports have been changed to the short-lived resource and are garbage collected right after their aggregation.

We are continuously working on optimizing the reporting system. As Jim mentioned above, we are working towards leveraging API aggregation to support alternate storage backends for reports in the Kyverno 1.12 release, see https://github.com/kyverno/KDP/pull/51.

Thanks @realshuting for the pointer, would you mind give an example cronjob name so I can use try a matched cronjob key in etcd?

@chaochn47 - you can search the cronjob “kyverno-cleanup-admission-reports” in the namespace that Kyverno was deployed.