aws-for-fluent-bit: Fluent-bit sidecar is killed because of networking error
Hey aws team, I want to flag you an issue we are currently having in our production system. We are using the following software versions:
- aws-for-fluent-bit: 2.6.1
- docker: 18.09.9-ce
- ecs agent: 1.36.2
Our tasks are running on EC2 mode. I didn’t check the behavior with fargate.
Fluent-bit sidecar container was killed with exit code 139, and as it is an essential container, our task suddenly stopped.
Fluent-bit logs during the crash
[2020/08/17 07:32:29] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
[engine] caught signal (SIGSEGV)
[2020/08/17 07:33:00] [error] [tls] SSL error: NET - Connection was reset by peer
[2020/08/17 07:33:00] [error] [src/flb_http_client.c:1077 errno=25] Inappropriate ioctl for device
[2020/08/17 07:33:00] [error] [output:datadog:datadog.1] could not flush records to http-intake.logs.datadoghq.com:443 (http_do=-1)
Docker daemon logs during the crash
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.678373199Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.910700627Z" level=error msg="Failed to log msg \"\" for logger fluentd: write unix @->/var/run/fluent.sock: write: broken pipe"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.926290270Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.931976771Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.932007976Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.932909970Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.933183296Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.933869906Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.979820912Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.979910901Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.980036496Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.983592287Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.992278230Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:01 <ip> dockerd[3364]: time="2020-08-17T07:33:01.112475840Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:01 <ip> dockerd[3364]: time="2020-08-17T07:33:01.112527938Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:01 <ip> dockerd[3364]: time="2020-08-17T07:33:01.115452827Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:01 <ip> dockerd[3364]: time="2020-08-17T07:33:01.132519235Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:01 <ip> dockerd[3364]: time="2020-08-17T07:33:01.132559388Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:01 <ip> dockerd[3364]: time="2020-08-17T07:33:01.136487318Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:01 <ip> dockerd[3364]: time="2020-08-17T07:33:01.140677989Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:01 <ip> dockerd[3364]: time="2020-08-17T07:33:01.295414547Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Fluent bit Metrics
Memory and CPU for fluent-bit logging sidecar is stable over the time

And there are not detected errors in our prometheus metrics for fluent bit container around 09:33 (CEST time, 07:33 UTC) but it might be because it crashed and metrics were not scrapped. Anyways, I’m sharing the screenshot to show that there are a bunch of errors with datadog in the last 3 hours (queries interval is 10 minutes)

I’m not sure if it’s related to https://github.com/aws/aws-for-fluent-bit/issues/63 - error messages are different but the behavior seems to be similar
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 43 (20 by maintainers)
@angulito Its simple; the net.keepalive option is just like any other option
For example, your logConfiguration could be:
@florent-tails We also released
aws-for-fluent-bit2.7.0 (which includesfluent-bit1.5.6) yesterday. If you want to play with it and report us back that would be great.Hey @PettitWesley and @truthbk, I finally was able to reproduce the issue using valgrind. Here you have the fluent-bit logs fluent-bit-restart-logs.txt
Hope it helps. Let me know if I can help with anything else!
fluent/fluent-bit 1.5.6 has been released with some more fixes: https://fluentbit.io/announcements/v1.5.6/
We have been a bit busy lately… an AWS for Fluent Bit release should come sometime next week.
We don’t have a solution yet, but the suggestion in the PR is to try turning off keepalive to prevent the bad code path from being executed:
The behavior of different servers/endpoints must somehow influence this… we have 2 reports for datadog users; I don’t have any reports from users of AWS destinations.
We are also struggling to reproduce on Datadog’s end, still to no avail.
We have tried to simulate connetion failures and were able to force TCP resets by peer:
But these were not following by segmentation faults or inapprioriate ioctl calls (though these may just be a red herring, these are not uncommon on TTY operations but should not crash the process).
It seems to me like this might be triggered not just by networking events/failures, but possibly also a race condition, in which case increasing the log rate to repro might help. If we do this, we must take into account the side-effects of valgrind (as amazing a tool as it is): because valgrind is essentially a VM that recompiles the source binary, and has an overhead that slow down runtime execution, it often makes it harder to trigger race conditions. So we might have to keep that in mind. It may be easier to enable core dumps if we can’t get good results with valgrind.
I think it’s safe to assume we’re all trying to reproduce on aws-for-fluent-bit: v2.6.1 (thus fluent-bit v1.5.2).
Also, I have come across this very similar case with the elasticsearch fluent bit plugin https://github.com/fluent/fluent-bit/issues/2416, so perhaps the issue is on the fluent bit core and not the plugins themselves.
@PettitWesley I am getting a lot of these error messages with v1.5.5:
Once I downgrade back to v1.5.4, I don’t see these “Resource temporarily unavailable” errors but I still see “Inappropriate ioctl for device”.
While on v1.5.5, I turned on debug mode and got some related logs:
Looking at the GCP Cloud Log API stats, I am only sending 20 logs per second.
@angulito @PettitWesley the valgrind output suggests that indeed we’re looking at the same issue that we’re trying to solve here: https://github.com/fluent/fluent-bit/pull/2507. That PR is still trying to get at the root cause; we’re obviously trying to free memory with a bad pointer, and it definitely seems to be a product of destroying that connection twice and how we have lingering events in the event queue that should be purged. It seems like the discussion is definitely going places, thank you @PettitWesley 🙇
Let me know if I can help in any way.
@angulito Here it is:
If you can catch the crash with this, Valgrind will tell you where in the code it originated from.
I set up 2 FireLens tasks with datadog outputs last night to try to repro this. One is just the latest AWS for Fluent bit image, one is a custom image built with the valgrind tool, which should help diagnose the segfault if it can be reproduced.