triggers: Event listener does not accept events after some time

Expected Behavior

I can always deliver events to an event listener

Actual Behavior

We enabled cloud events from taskruns in our dogfooding clusters. This means that every single task run and condition that runs generates three events (start, running, passed/failed), and these events are all sent to a single event listener in the dogfooding cluster.

A few of these events are selected to trigger pipelines:

when start/pass/fail events are received from a taskrun part of a CI job, the github check status is updated
when fail events are received from a taskrun part of a config map CD service, a notification is sent to slack

It happened twice in the past three weeks, since this service was enabled, that the event listener stopped accepting events. Looking at the k8s events on one of the taskruns it shows:

Events:
  Type     Reason               Age                 From     Message
  ----     ------               ----                ----     -------
  Normal   Started              28m                 TaskRun
  Normal   Pending              28m                 TaskRun  Pending
  Normal   Pending              28m                 TaskRun  pod status "Initialized":"False"; message: "containers with incomplete status: [place-scripts place-tools]"
  Normal   Pending              28m                 TaskRun  pod status "Initialized":"False"; message: "containers with incomplete status: [place-tools]"
  Normal   Pending              28m                 TaskRun  pod status "Ready":"False"; message: "containers with unready status: [step-git-source-source-sh9dr step-kubeconfig-dlvwz step-ln-dir-targetcluster-vv5r7 step-fetch-current-config step-deploy]"
  Normal   Running              28m                 TaskRun  Not all Steps in the Task have finished executing
  Normal   Succeeded            28m                 TaskRun  All Steps have completed executing
  Warning  Cloud Event Failure  28m (x13 over 28m)  TaskRun  Post http://el-tekton-events.default:8080: dial tcp 10.121.3.142:8080: connect: cannot assign requested address (10x)

The pod and service associated to the event listener do not expose any obvious issue - they appear to be healthy. The event listener itself seems to be fine. Deleting and recreating the event listeners solves the issue.

Unfortunately I do not have more data to share about this, I had to find a quick fix to restore CI, but perhaps the triggers team has an idea about what may be going wrong or what kind of data we may collect to catch this issue the next time it happens.

My guess is that a relatively high volume of events triggers the issue, but it’s not clear in which component the issue happens exactly.

Steps to Reproduce the Problem

Additional Info

Triggers v0.5.0 Kube v0.16

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 17 (15 by maintainers)

Commits related to this issue

Add a IdleTimeout for EL sink Previously, we were never closing Idle connections leading to issues described in #687. This commit adds a fixed 2 minute timeout for idle connections though later we c... — committed to dibyom/triggers by dibyom 4 years ago
Add a IdleTimeout for EL sink Previously, we were never closing Idle connections leading to issues described in #687. This commit adds a fixed 2 minute timeout for idle connections though later we c... — committed to dibyom/triggers by dibyom 4 years ago
Add a IdleTimeout for EL sink Previously, we were never closing Idle connections leading to issues described in #687. This commit adds a fixed 2 minute timeout for idle connections though later we c... — committed to tektoncd/triggers by dibyom 4 years ago

Most upvoted comments

I’ve seen this before, when failing to close a request body (running out of sockets).

I’ll take a look and see if that could be the case.

bigkevmcd on Jul 30, 2020

Cool! I started looking into https://github.com/tektoncd/pipeline/blob/386637407f6715750dd643a5c740ecd9b2380b7e/pkg/reconciler/events/cloudevent/cloudeventclient.go#L38 - that’s where we create a new HTTP client every time, which comes with its own transport that I think by default reuses connection so it keeps them open. We could try to re-use the same client across reconcile cycles… or we could change the transport configuration to close connections.

afrittoli on Sep 9, 2020

I did some more tests with the pipeline controller, and I can confirm that the HTTP client is only created once. However the original HTTP client does not have the target set (it is set in https://github.com/tektoncd/pipeline/blob/386637407f6715750dd643a5c740ecd9b2380b7e/pkg/reconciler/events/event.go#L55), so I suspect the client will create a new transport each time Emit is invoked.

afrittoli on Sep 10, 2020

We could try to re-use the same client across reconcile cycles… or we could change the transport configuration to close connections.

Reusing the same client should help I think?

Also, it seems like the high memory usage in Triggers was due to the large num of established connections. If I delete/restart the pipeline controller, the number of connections drops immediately and so does the memory usage:

dibyom on Sep 9, 2020

I was able to somewhat repro this -> I setup a pipelines controller to send cloud events to el-tekton-events in my controller. I used a modified version of the sink image so I could get a shell. Some findings:

Each cloud event call seems to create a new TCP connection.
These connections stay established forever! I ran a bunch of TaskTun calls yesterday -> all of those connections were still in the established state this morning:

root@el-tekton-events-75b55c5888-pqwmw:/go# ss -s
Total: 1714
TCP:   1799 (estab 1712, closed 86, orphaned 0, timewait 10)

Transport Total     IP        IPv6
RAW       0         0         0
UDP       0         0         0
TCP       1713      4         1709
INET      1713      4         1709
FRAG      0         0         0

I could also see the memory footprint steadily increasing -> though the EL did not stall
I instrumented the code and added some profiling though I have yet to go through the flame graphs

Next Steps

For Triggers: We should close Idle connections. @khrm has #747 open for adding some
For Pipelines: We should look into if cloud events needs to open a new connection each time and also close idle connections after a reasonable amount of time.

dibyom on Sep 9, 2020

Alternatively it could also be that at busy time the pod takes to long to answer the health-checks, and thus is restarted, which causes lost messages. I wonder if this could lead to connection left open on the source too.

afrittoli on Aug 12, 2020