triggers: Event listener does not accept events after some time
Expected Behavior
I can always deliver events to an event listener
Actual Behavior
We enabled cloud events from taskruns in our dogfooding clusters. This means that every single task run and condition that runs generates three events (start, running, passed/failed), and these events are all sent to a single event listener in the dogfooding cluster.
A few of these events are selected to trigger pipelines:
- when start/pass/fail events are received from a taskrun part of a CI job, the github check status is updated
- when fail events are received from a taskrun part of a config map CD service, a notification is sent to slack
It happened twice in the past three weeks, since this service was enabled, that the event listener stopped accepting events. Looking at the k8s events on one of the taskruns it shows:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Started 28m TaskRun
Normal Pending 28m TaskRun Pending
Normal Pending 28m TaskRun pod status "Initialized":"False"; message: "containers with incomplete status: [place-scripts place-tools]"
Normal Pending 28m TaskRun pod status "Initialized":"False"; message: "containers with incomplete status: [place-tools]"
Normal Pending 28m TaskRun pod status "Ready":"False"; message: "containers with unready status: [step-git-source-source-sh9dr step-kubeconfig-dlvwz step-ln-dir-targetcluster-vv5r7 step-fetch-current-config step-deploy]"
Normal Running 28m TaskRun Not all Steps in the Task have finished executing
Normal Succeeded 28m TaskRun All Steps have completed executing
Warning Cloud Event Failure 28m (x13 over 28m) TaskRun Post http://el-tekton-events.default:8080: dial tcp 10.121.3.142:8080: connect: cannot assign requested address (10x)
The pod and service associated to the event listener do not expose any obvious issue - they appear to be healthy. The event listener itself seems to be fine. Deleting and recreating the event listeners solves the issue.
Unfortunately I do not have more data to share about this, I had to find a quick fix to restore CI, but perhaps the triggers team has an idea about what may be going wrong or what kind of data we may collect to catch this issue the next time it happens.
My guess is that a relatively high volume of events triggers the issue, but it’s not clear in which component the issue happens exactly.
Steps to Reproduce the Problem
Additional Info
Triggers v0.5.0 Kube v0.16
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 17 (15 by maintainers)
Commits related to this issue
- Add a IdleTimeout for EL sink Previously, we were never closing Idle connections leading to issues described in #687. This commit adds a fixed 2 minute timeout for idle connections though later we c... — committed to dibyom/triggers by dibyom 4 years ago
- Add a IdleTimeout for EL sink Previously, we were never closing Idle connections leading to issues described in #687. This commit adds a fixed 2 minute timeout for idle connections though later we c... — committed to dibyom/triggers by dibyom 4 years ago
- Add a IdleTimeout for EL sink Previously, we were never closing Idle connections leading to issues described in #687. This commit adds a fixed 2 minute timeout for idle connections though later we c... — committed to tektoncd/triggers by dibyom 4 years ago
I’ve seen this before, when failing to close a request body (running out of sockets).
I’ll take a look and see if that could be the case.
Cool! I started looking into https://github.com/tektoncd/pipeline/blob/386637407f6715750dd643a5c740ecd9b2380b7e/pkg/reconciler/events/cloudevent/cloudeventclient.go#L38 - that’s where we create a new HTTP client every time, which comes with its own transport that I think by default reuses connection so it keeps them open. We could try to re-use the same client across reconcile cycles… or we could change the transport configuration to close connections.
I did some more tests with the pipeline controller, and I can confirm that the HTTP client is only created once. However the original HTTP client does not have the target set (it is set in https://github.com/tektoncd/pipeline/blob/386637407f6715750dd643a5c740ecd9b2380b7e/pkg/reconciler/events/event.go#L55), so I suspect the client will create a new transport each time
Emitis invoked.Reusing the same client should help I think?
Also, it seems like the high memory usage in Triggers was due to the large num of established connections. If I delete/restart the pipeline controller, the number of connections drops immediately and so does the memory usage:
I was able to somewhat repro this -> I setup a pipelines controller to send cloud events to el-tekton-events in my controller. I used a modified version of the sink image so I could get a shell. Some findings:
Each cloud event call seems to create a new TCP connection.
These connections stay established forever! I ran a bunch of TaskTun calls yesterday -> all of those connections were still in the established state this morning:
I could also see the memory footprint steadily increasing -> though the EL did not stall
I instrumented the code and added some profiling though I have yet to go through the flame graphs
Next Steps
Alternatively it could also be that at busy time the pod takes to long to answer the health-checks, and thus is restarted, which causes lost messages. I wonder if this could lead to connection left open on the source too.