kubernetes: [Flaky test] Advanced Audit tests are flaky in 1.14-blocking

Which jobs are flaking: https://testgrid.k8s.io/sig-release-1.14-blocking#gce-cos-1.14-default

Which test(s) are flaking:

  • [sig-auth] Advanced Audit [DisabledForLargeClusters] should audit API calls to create, get, update, patch, delete, list, watch secrets.
  • [sig-auth] Advanced Audit [DisabledForLargeClusters] should audit API calls to create, get, update, patch, delete, list, watch configmaps.

Reasons for flaking: Seem to be mostly timeouts, for example: 9538, 9543, 9544.

Since when has it been flaking: Since 2/20.

Testgrid link: https://testgrid.k8s.io/sig-release-1.14-blocking#gce-cos-1.14-default

Anything else we need to know: Latest failure logs can be found in https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-cos-k8sbeta-default/9544.

cc @mortent @kacole2 @mariantalla @alejandrox1

/kind flaky-test /priority important-soon /sig auth

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 19 (18 by maintainers)

Most upvoted comments

the methodology the test is using to scan for audit events is inherently flaky in the face of server-side log rotation

spoke with @tallclair about this, and I think we should do the following:

  • now: mark the existing e2e tests as [flaky] while we resolve issues
  • short-term: take detailed specific-request-to-specific-audit-event tests and make sure we have integration tests covering that
  • medium-term: change e2e tests to general “audit for resource X is enabled” and make them much more robust by moving away from trying observe an audit event for a particular request, and unmark them flaky
  • long-term: change e2e tests to use dynamic audit and point at a test-specific audit sink

Once DynamicAudit goes to beta, we should be able to eliminate the reliance on the log files by using a webhook to verify the audit stream instead. Thanks for investigating @pbarker !