trivy-operator: Missing vulnerability reports
What steps did you take and what happened:
Since I installed version 0.4.0 of Trivy-Operator I kept bumping into situations where vulnerability reports are not created for some resources. After some time, they started to get created again. Now, that I upgraded to latest version, 0.7.1, I still have this problem, but it is occurring for even more resources. I can’t really trust Trivy anymore at this point that it can scan everything in my cluster and I am just relying on luck, if a resource is scanned or not.
Previously, o version 0.6.0, when I finally saw most of the reports generated I got happy and then, over the weekend, more than 60% of reports were gone, and none got recreated. And no restart fixed it.
Today I upgraded to 0.7.1 and using the trivy-server approach. It is better, but still missing more than 30% of the reports.
I do see this error, though in the operator logs (since upgrading to 0.6.0):
{"level":"error","ts":1669370944.5371487,"msg":"Reconciler error","controller":"job","controllerGroup":"batch","controllerKind":"Job","Job":{"name":"scan-vulnerabilityreport-7cc6d7794b","namespace":"trivy"},"namespace":"trivy","name":"scan-vulnerabilityreport-7cc6d7794b","reconcileID":"0976a9b3-47b9-4e98-9347-5ecbae54e1e5","error":"illegal base64 data at input byte 0; unexpected EOF","errorCauses":[{"error":"illegal base64 data at input byte 0"},{"error":"unexpected EOF"}],"stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:326\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:234"}
I mention that I already disabled the scanJobCompressLogs (by setting the value to false)
What did you expect to happen:
All resources to be correctly identified, scanned and vulnerability reports generated. I am also expecting that the operator will log a warning or error if it cannot start a scan for a particular resource.
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
Environment:
-
Trivy-Operator version: 0.7.1
-
Kubernetes version (use
kubectl version
):
Client Version: version.Info{Major:“1”, Minor:“22”, GitVersion:“v1.22.0”, GitCommit:“c2b5237ccd9c0f1d600d3072634ca66cefdf272f”, GitTreeState:“clean”, BuildDate:“2021-08-04T18:03:20Z”, GoVersion:“go1.16.6”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“23”, GitVersion:“v1.23.8”, GitCommit:“83d00b7cbf10e530d1d4b2403f22413220c37621”, GitTreeState:“clean”, BuildDate:“2022-11-09T19:50:11Z”, GoVersion:“go1.17.11”, Compiler:“gc”, Platform:“linux/amd64”}
- OS (macOS 10.15, Windows 10, Ubuntu 19.10 etc): Ubuntu 18.04
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 32 (1 by maintainers)
It’s mostly cronjobs but one (from the above example) is a deployment:
(
trivy-operator.resource.kind=ReplicaSet
)my kubernetes version is
v1.25.3
and yes the
trivy-operator.container-images
annotation of the scan job references an container from a different namespace 😦The situation happens more often, when there are parsing issues (
illegal base64 data
) but this can mitigated withtrivyOperator.scanJobCompressLogs=false
@pschulten have a look at #509 and see if your completed undeleted jobs have the mismatched labels/annotations described there. Also, which version of k8s are you on? There is a bug in certain versions of kubernetes 1.24 (and maybe 1.23 but fixed in I think 1.24.4 or .5) related to cleanup of errored jobs. Combined those seem to cause trivy-operator to reach its concurrent scan job limit and stop processing new workloads.
I see ,thanks , I assume that scan is halted when num of scan job reach the maxConcurrent.
I will check this error
"error":"getting logs for pod xyz: container xyz is not valid for pod scan-vulnerabilityreport-xyz"
if we could workaround it and avoid deletion and restart@chen-keinan unfortunately, when disabling compression , the reports might be too big and other errors occurs. I think that is best to close this issue as most of the reports are now in, and only occasional occurrences of missing reports. Also, I know now better how the operator works and how to workaround different issues.
Will look forward for more improvements in the future and will closely be following new features/fixes added.
Also, thank you for your patience and the work you do for this project.