trivy-operator: Missing vulnerability reports

What steps did you take and what happened:

Since I installed version 0.4.0 of Trivy-Operator I kept bumping into situations where vulnerability reports are not created for some resources. After some time, they started to get created again. Now, that I upgraded to latest version, 0.7.1, I still have this problem, but it is occurring for even more resources. I can’t really trust Trivy anymore at this point that it can scan everything in my cluster and I am just relying on luck, if a resource is scanned or not.

Previously, o version 0.6.0, when I finally saw most of the reports generated I got happy and then, over the weekend, more than 60% of reports were gone, and none got recreated. And no restart fixed it.

Today I upgraded to 0.7.1 and using the trivy-server approach. It is better, but still missing more than 30% of the reports.

I do see this error, though in the operator logs (since upgrading to 0.6.0):

{"level":"error","ts":1669370944.5371487,"msg":"Reconciler error","controller":"job","controllerGroup":"batch","controllerKind":"Job","Job":{"name":"scan-vulnerabilityreport-7cc6d7794b","namespace":"trivy"},"namespace":"trivy","name":"scan-vulnerabilityreport-7cc6d7794b","reconcileID":"0976a9b3-47b9-4e98-9347-5ecbae54e1e5","error":"illegal base64 data at input byte 0; unexpected EOF","errorCauses":[{"error":"illegal base64 data at input byte 0"},{"error":"unexpected EOF"}],"stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:326\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:234"}

I mention that I already disabled the scanJobCompressLogs (by setting the value to false)

What did you expect to happen:

All resources to be correctly identified, scanned and vulnerability reports generated. I am also expecting that the operator will log a warning or error if it cannot start a scan for a particular resource.

Anything else you would like to add:

[Miscellaneous information that will assist in solving the issue.]

Environment:

  • Trivy-Operator version: 0.7.1

  • Kubernetes version (use kubectl version):

Client Version: version.Info{Major:“1”, Minor:“22”, GitVersion:“v1.22.0”, GitCommit:“c2b5237ccd9c0f1d600d3072634ca66cefdf272f”, GitTreeState:“clean”, BuildDate:“2021-08-04T18:03:20Z”, GoVersion:“go1.16.6”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“23”, GitVersion:“v1.23.8”, GitCommit:“83d00b7cbf10e530d1d4b2403f22413220c37621”, GitTreeState:“clean”, BuildDate:“2022-11-09T19:50:11Z”, GoVersion:“go1.17.11”, Compiler:“gc”, Platform:“linux/amd64”}

  • OS (macOS 10.15, Windows 10, Ubuntu 19.10 etc): Ubuntu 18.04

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 32 (1 by maintainers)

Most upvoted comments

It’s mostly cronjobs but one (from the above example) is a deployment:

ccccccc-prod     scan-vulnerabilityreport-6bcbc4f884-kgqwx   0/2     Completed   0          16h   app.kubernetes.io/managed-by=trivy-operator,controller-uid=1db7901e-8c90-4b09-b41e-0dd1bd22adc9,job-name=scan-vulnerabilityreport-6bcbc4f884,resource-spec-hash=558684f6cf,topic=ccccccc,trivy-operator.resource.kind=ReplicaSet,trivy-operator.resource.name=redacted,trivy-operator.resource.namespace=ccccccc-prod,vulnerabilityReport.scanner=Trivy

(trivy-operator.resource.kind=ReplicaSet)

my kubernetes version is v1.25.3

and yes the trivy-operator.container-images annotation of the scan job references an container from a different namespace 😦

k get jobs -l "trivy-operator.resource.name" -o=custom-columns='name:metadata.name,resname:metadata.annotations.trivy-operator\.container-images'

The situation happens more often, when there are parsing issues (illegal base64 data) but this can mitigated with trivyOperator.scanJobCompressLogs=false

@pschulten have a look at #509 and see if your completed undeleted jobs have the mismatched labels/annotations described there. Also, which version of k8s are you on? There is a bug in certain versions of kubernetes 1.24 (and maybe 1.23 but fixed in I think 1.24.4 or .5) related to cleanup of errored jobs. Combined those seem to cause trivy-operator to reach its concurrent scan job limit and stop processing new workloads.

you mean trivy-operator releases? It’s happening with different releases. I tried 0.7.0 and 0.9.1

I see ,thanks , I assume that scan is halted when num of scan job reach the maxConcurrent.

I will check this error "error":"getting logs for pod xyz: container xyz is not valid for pod scan-vulnerabilityreport-xyz" if we could workaround it and avoid deletion and restart

@chen-keinan unfortunately, when disabling compression , the reports might be too big and other errors occurs. I think that is best to close this issue as most of the reports are now in, and only occasional occurrences of missing reports. Also, I know now better how the operator works and how to workaround different issues.

Will look forward for more improvements in the future and will closely be following new features/fixes added.

Also, thank you for your patience and the work you do for this project.