aws-for-fluent-bit: Fluent-bit sidecar is killed because of networking error

Hey aws team, I want to flag you an issue we are currently having in our production system. We are using the following software versions:

aws-for-fluent-bit: 2.6.1
docker: 18.09.9-ce
ecs agent: 1.36.2

Our tasks are running on EC2 mode. I didn’t check the behavior with fargate.

Fluent-bit sidecar container was killed with exit code 139, and as it is an essential container, our task suddenly stopped.

Fluent-bit logs during the crash

[2020/08/17 07:32:29] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
[engine] caught signal (SIGSEGV)
[2020/08/17 07:33:00] [error] [tls] SSL error: NET - Connection was reset by peer
[2020/08/17 07:33:00] [error] [src/flb_http_client.c:1077 errno=25] Inappropriate ioctl for device
[2020/08/17 07:33:00] [error] [output:datadog:datadog.1] could not flush records to http-intake.logs.datadoghq.com:443 (http_do=-1)

Docker daemon logs during the crash

Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.678373199Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.910700627Z" level=error msg="Failed to log msg \"\" for logger fluentd: write unix @->/var/run/fluent.sock: write: broken pipe"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.926290270Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.931976771Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.932007976Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.932909970Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.933183296Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.933869906Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.979820912Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.979910901Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.980036496Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.983592287Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.992278230Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:01 <ip> dockerd[3364]: time="2020-08-17T07:33:01.112475840Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:01 <ip> dockerd[3364]: time="2020-08-17T07:33:01.112527938Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:01 <ip> dockerd[3364]: time="2020-08-17T07:33:01.115452827Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:01 <ip> dockerd[3364]: time="2020-08-17T07:33:01.132519235Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:01 <ip> dockerd[3364]: time="2020-08-17T07:33:01.132559388Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:01 <ip> dockerd[3364]: time="2020-08-17T07:33:01.136487318Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:01 <ip> dockerd[3364]: time="2020-08-17T07:33:01.140677989Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:01 <ip> dockerd[3364]: time="2020-08-17T07:33:01.295414547Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"

Fluent bit Metrics

Memory and CPU for fluent-bit logging sidecar is stable over the time

And there are not detected errors in our prometheus metrics for fluent bit container around 09:33 (CEST time, 07:33 UTC) but it might be because it crashed and metrics were not scrapped. Anyways, I’m sharing the screenshot to show that there are a bunch of errors with datadog in the last 3 hours (queries interval is 10 minutes)

I’m not sure if it’s related to https://github.com/aws/aws-for-fluent-bit/issues/63 - error messages are different but the behavior seems to be similar

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 43 (20 by maintainers)

Most upvoted comments

I would like to ask you if there is any way to set up net.keepalive false via awsfirelens code automatically (without needing to manage the entire config by ourselves)

@angulito Its simple; the net.keepalive option is just like any other option

For example, your logConfiguration could be:

			 "logConfiguration": {
				 "logDriver":"awsfirelens",
				 "options": {
					"Name": "datadog",
					"apiKey": "<key>",
					"provider": "ecs",
					"dd_service": "my-app-service",
					"TLS": "On",
					"dd_source": "my-app",
                                        "net.keepalive": "false"
				}
			}

PettitWesley on Sep 2, 2020

@florent-tails We also released aws-for-fluent-bit 2.7.0 (which includes fluent-bit 1.5.6) yesterday. If you want to play with it and report us back that would be great.

hossain-rayhan on Sep 17, 2020

Hey @PettitWesley and @truthbk, I finally was able to reproduce the issue using valgrind. Here you have the fluent-bit logs fluent-bit-restart-logs.txt

Hope it helps. Let me know if I can help with anything else!

angulito on Aug 31, 2020

fluent/fluent-bit 1.5.6 has been released with some more fixes: https://fluentbit.io/announcements/v1.5.6/

We have been a bit busy lately… an AWS for Fluent Bit release should come sometime next week.

PettitWesley on Sep 11, 2020

We don’t have a solution yet, but the suggestion in the PR is to try turning off keepalive to prevent the bad code path from being executed:

[OUTPUT]
    name            something
    match            *
    net.keepalive false

The behavior of different servers/endpoints must somehow influence this… we have 2 reports for datadog users; I don’t have any reports from users of AWS destinations.

PettitWesley on Sep 1, 2020

We are also struggling to reproduce on Datadog’s end, still to no avail.

We have tried to simulate connetion failures and were able to force TCP resets by peer:

flb_io_tls.c:356 NET - Connection was reset by peer

But these were not following by segmentation faults or inapprioriate ioctl calls (though these may just be a red herring, these are not uncommon on TTY operations but should not crash the process).

It seems to me like this might be triggered not just by networking events/failures, but possibly also a race condition, in which case increasing the log rate to repro might help. If we do this, we must take into account the side-effects of valgrind (as amazing a tool as it is): because valgrind is essentially a VM that recompiles the source binary, and has an overhead that slow down runtime execution, it often makes it harder to trigger race conditions. So we might have to keep that in mind. It may be easier to enable core dumps if we can’t get good results with valgrind.

I think it’s safe to assume we’re all trying to reproduce on aws-for-fluent-bit: v2.6.1 (thus fluent-bit v1.5.2).

Also, I have come across this very similar case with the elasticsearch fluent bit plugin https://github.com/fluent/fluent-bit/issues/2416, so perhaps the issue is on the fluent bit core and not the plugins themselves.

truthbk on Aug 26, 2020

@PettitWesley I am getting a lot of these error messages with v1.5.5:

[2020/09/08 22:46:59] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:46:59] [error] [src/flb_upstream.c:356 errno=25] Inappropriate ioctl for device
[2020/09/08 22:47:03] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:47:04] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:47:18] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:47:19] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:47:33] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:47:34] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:47:48] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:47:49] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:48:03] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:48:04] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:48:18] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:48:19] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:48:33] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:48:34] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:48:48] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:48:49] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:49:03] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:49:04] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:49:18] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:49:19] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable

Once I downgrade back to v1.5.4, I don’t see these “Resource temporarily unavailable” errors but I still see “Inappropriate ioctl for device”.

While on v1.5.5, I turned on debug mode and got some related logs:

[2020/09/08 22:45:15] [debug] [input:tail:myfile.tail] inode=25194448 events: IN_MODIFY
[2020/09/08 22:45:15] [debug] [input:tail:qa02_uswest2.kubernetes.tail] inode=25194448 events: IN_MODIFY
[2020/09/08 22:45:15] [debug] [task] created task=0x7f6c93ab2540 id=0 without routes, dropping.
[2020/09/08 22:45:15] [debug] [task] destroy task=0x7f6c93ab2540 (task_id=0)
[2020/09/08 22:45:15] [debug] [task] created task=0x7f6c93ab2540 id=0 without routes, dropping.
[2020/09/08 22:45:15] [debug] [task] destroy task=0x7f6c93ab2540 (task_id=0)
[2020/09/08 22:45:15] [debug] [task] created task=0x7f6c93ab2540 id=0 OK
[2020/09/08 22:45:15] [error] [src/flb_upstream.c:356 errno=25] Inappropriate ioctl for device
[2020/09/08 22:45:15] [debug] [upstream] KA connection #23 is in a failed state to: logging.googleapis.com:443, cleaning up
[2020/09/08 22:45:16] [debug] [output:stackdriver:stackdriver.0] HTTP Status=200
[2020/09/08 22:45:16] [debug] [upstream] KA connection #23 to logging.googleapis.com:443 is now available

Looking at the GCP Cloud Log API stats, I am only sending 20 logs per second.

stevenarvar on Sep 8, 2020

@angulito @PettitWesley the valgrind output suggests that indeed we’re looking at the same issue that we’re trying to solve here: https://github.com/fluent/fluent-bit/pull/2507. That PR is still trying to get at the root cause; we’re obviously trying to free memory with a bad pointer, and it definitely seems to be a product of destroying that connection twice and how we have lingering events in the event queue that should be purged. It seems like the discussion is definitely going places, thank you @PettitWesley 🙇

Let me know if I can help in any way.

truthbk on Aug 31, 2020

@angulito Here it is:

FROM amazon/aws-for-fluent-bit:latest
RUN yum -y install  valgrind
ENTRYPOINT valgrind /fluent-bit/bin/fluent-bit  -c /fluent-bit/etc/fluent-bit.conf

If you can catch the crash with this, Valgrind will tell you where in the code it originated from.

PettitWesley on Aug 26, 2020

I set up 2 FireLens tasks with datadog outputs last night to try to repro this. One is just the latest AWS for Fluent bit image, one is a custom image built with the valgrind tool, which should help diagnose the segfault if it can be reproduced.

PettitWesley on Aug 20, 2020