triggers: Interceptor fails to get secrets which lead to unrecoverable eventlistener failures

Since upgrading to tekton triggers: 0.21 we have noticed our eventlisteners would suddenly stop working and the error logs would be filled with:

{"level":"error","ts":"2022-11-03T11:49:19.299Z","logger":"eventlistener","caller":"sink/sink.go:381","msg":"Post \"https://tekton-triggers-core-interceptors.tekton-pipelines.svc:8443/cel\": x509: certificate signed by unknown authority (possibly because of \"x509: ECDSA verification failure\" while trying to verify candidate authority certificate \"tekton-triggers-core-interceptors.tekton-pipelines.svc\")","eventlistener":"mylistener","namespace":"app-prod","/triggers-eventid":"20b3c97f-90e2-4f85-8fa9-05dffc7259a9","eventlistenerUID":"07ef4762-72d6-4e0f-8813-7bd84ca49d07","/triggers-eventid":"20b3c97f-90e2-4f85-8fa9-05dffc7259a9","/trigger":"rake-task","stacktrace":"github.com/tektoncd/triggers/pkg/sink.Sink.processTrigger\n\tgithub.com/tektoncd/triggers/pkg/sink/sink.go:381\ngithub.com/tektoncd/triggers/pkg/sink.Sink.HandleEvent.func1\n\tgithub.com/tektoncd/triggers/pkg/sink/sink.go:196"}

And the errors will continue to happen until the Eventlistener pod is killed.

After looking around I noticed at the same time the eventlistener starts having errors there are logs in the core-interceptor pod:

{"level":"info","ts":1667570794.8050323,"caller":"server/server.go:150","msg":"Interceptor response is: &{Extensions:map[] Continue:true Status:{Code:OK Message:}}"}
{"level":"info","ts":1667570867.9983332,"caller":"server/server.go:150","msg":"Interceptor response is: &{Extensions:map[] Continue:false Status:{Code:FailedPrecondition Message:expression has(body.task) did not return true}}"}
{"level":"info","ts":1667570867.998569,"caller":"server/server.go:150","msg":"Interceptor response is: &{Extensions:map[] Continue:true Status:{Code:OK Message:}}"}
{"level":"info","ts":1667570891.6357741,"caller":"server/server.go:150","msg":"Interceptor response is: &{Extensions:map[] Continue:false Status:{Code:FailedPrecondition Message:expression has(body.task) did not return true}}"}
{"level":"info","ts":1667521790.5477786,"logger":"fallback","caller":"injection/injection.go:61","msg":"Starting informers..."}
W1104 00:30:20.548510       1 reflector.go:324] runtime/asm_amd64.s:1571: failed to list *v1.Secret: Get "[https://172.20.0.1:443/api/v1/secrets?limit=500&resourceVersion=0](https://172.20.0.1/api/v1/secrets?limit=500&resourceVersion=0)": dial tcp 172.20.0.1:443: i/o timeout
I1104 00:30:20.548708       1 trace.go:205] Trace[1298498081]: "Reflector ListAndWatch" name:runtime/asm_amd64.s:1571 (04-Nov-2022 00:29:50.547) (total time: 30000ms):
Trace[1298498081]: ---"Objects listed" error:Get "[https://172.20.0.1:443/api/v1/secrets?limit=500&resourceVersion=0](https://172.20.0.1/api/v1/secrets?limit=500&resourceVersion=0)": dial tcp 172.20.0.1:443: i/o timeout 30000ms (00:30:20.548)
Trace[1298498081]: [30.000783773s] [30.000783773s] END
E1104 00:30:20.548730       1 reflector.go:138] runtime/asm_amd64.s:1571: Failed to watch *v1.Secret: failed to list *v1.Secret: Get "[https://172.20.0.1:443/api/v1/secrets?limit=500&resourceVersion=0](https://172.20.0.1/api/v1/secrets?limit=500&resourceVersion=0)": dial tcp 172.20.0.1:443: i/o timeout
{"level":"info","ts":1667521823.7331553,"caller":"interceptors/main.go:178","msg":"Listen and serve on port 8443"}
2022/11/04 00:35:27 http: TLS handshake error from 10.40.62.237:35114: remote error: tls: bad certificate

Expected Behavior

Ideally it should not timeout… but if it does then when its able to list/verify things again the eventlisteners should re-connect correctly … OR the eventlistener should become unhealthy which would allow for it to be automatically killed and new eventlistener pod to start.

Actual Behavior

The eventlistener is never able to re-connect and will always throw the error: x509: certificate signed by unknown authority (possibly because of \"x509: ECDSA verification failure\" while trying to verify candidate authority certificate over and over again (for hours) untili the eventlistener pod is manually killed.

Steps to Reproduce the Problem

No idea how to re-produce it.

Additional Info

Kubernetes version:

Output of kubectl version:

Client Version: version.Info{Major:“1”, Minor:“23”, GitVersion:“v1.23.3”, GitCommit:“816c97ab8cff8a1c72eccca1026f7820e93e0d25”, GitTreeState:“clean”, BuildDate:“2022-01-25T21:25:17Z”, GoVersion:“go1.17.6”, Compiler:“gc”, Platform:“darwin/amd64”} Server Version: version.Info{Major:“1”, Minor:“23+”, GitVersion:“v1.23.13-eks-fb459a0”, GitCommit:“55bd5d5cb7d32bc35e4e050f536181196fb8c6f7”, GitTreeState:“clean”, BuildDate:“2022-10-24T20:35:40Z”, GoVersion:“go1.17.13”, Compiler:“gc”, Platform:“linux/amd64”}


- Tekton Pipeline version:

**Output of `tkn version` or `kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'`**

Client version: 0.27.0 Pipeline version: v0.40.2 Triggers version: v0.21.0 Dashboard version: v0.29.2


We are running in AWS EKS. We have been using Tekton for ~2 years and have never seen the eventlistener behave in this manor before.

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 15 (6 by maintainers)

Most upvoted comments

My comment didnt take.

After upgrading to Triggers v0.22.2 and letting it run for ~6 days this issue has not re-occurred. As we could not reliably replicate outside of our production environment and with the upgrade the issue is not happening. I am closing this issue.

jwitrick on Mar 15, 2023