trivy-operator: Faulty scan jobs blocking further scans from being executed

What steps did you take and what happened:

Due to an error reported in https://github.com/aquasecurity/trivy-operator/issues/206 scan jobs getting stuck. In this case, other PODs will not be scanned anymore as when the OPERATOR_CONCURRENT_SCAN_JOBS_LIMITis reached, no more Scan PODs will be re-spawned up anymore as trivy-operator still wait for them to finish.

Example (due to the error in https://github.com/aquasecurity/trivy-operator/issues/206) :

scan-vulnerabilityreport-5759f44647--1-qf7sh   0/1     Completed   0          7m49s
scan-vulnerabilityreport-7d57cffd5f--1-47vds   0/1     Completed   0          2m58s
scan-vulnerabilityreport-849fffd5c7--1-p9fdt   0/1     Completed   0          6m58s
scan-vulnerabilityreport-dc5fb6cf--1-xq5kw     0/1     Completed   0          7m28s
scan-vulnerabilityreport-f49679dcc--1-cvd8x    0/1     Completed   0          118s

What did you expect to happen: Even though that jobs get stuck due to an unforeseen error, they should get released after some time to make sure that the scan will continue with other Repositories/Registries. Otherwise, no more scan is happening.

Anything else you would like to add:

If the Job/Pod gets manually deleted it is likely that trivy-operator picks up any other remaining deployment to scan, and then scanning continues, but if it comes back to the deployment which results back into the error, again the POD gets stuck. So to get all deployments scanned you need to increase the OPERATOR_CONCURRENT_SCAN_JOBS_LIMIT to a high value and you need frequently to delete all jobs/pods which got hung, to give ‘trivy-operator’ the freedom to re-spawn new scans.

Environment:

  • Trivy-Operator version: 0.1.0
  • Kubernetes version: 1.22

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 4
  • Comments: 18 (13 by maintainers)

Most upvoted comments

@VF-mbrauer this issue is under investigation, I will update you once we have a solid solution.

I think you mean completed ones in

lead to resource consumption, as the not completed ones will still occupy vCPU and MEM at that time.

?