aws-for-fluent-bit: Crashes with "Could not find sequence token in response: response body is empty"

Hi folks,

We’ve been running the 2.12.0 release to ship our logs to CloudWatch with the new cloudwatch_logs plugin. We’ve been waiting for the fix to renewing STS tokens so this is our first outing with it.

After running reliably for several hours, several of our pods have crashed with:

[2021/03/09 22:21:54] [ info] [output:cloudwatch_logs:cloudwatch_logs.5] Created log stream fluent-bit-z64wb.application.<redacted>.log
[2021/03/09 22:22:55] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Could not find sequence token in response: response body is empty
[2021/03/09 22:22:55] [error] [src/flb_http_client.c:1163 errno=32] Broken pipe
[2021/03/09 22:22:55] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send log events
[2021/03/09 22:22:55] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send log events
[2021/03/09 22:22:55] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send events
[2021/03/09 22:22:56] [error] [output:cloudwatch_logs:cloudwatch_logs.4] Could not find sequence token in response: response body is empty
[lib/chunkio/src/cio_file.c:786 errno=9] Bad file descriptor
[2021/03/09 22:22:56] [error] [storage] [cio_file] error setting new file size on write
[2021/03/09 22:22:56] [error] [input chunk] error writing data from tail.5 instance
[lib/chunkio/src/cio_file.c:786 errno=9] Bad file descriptor
[2021/03/09 22:22:56] [error] [storage] [cio_file] error setting new file size on write
[2021/03/09 22:22:56] [error] [input chunk] error writing data from tail.5 instance
[2021/03/09 22:23:02] [ warn] [engine] failed to flush chunk '1-1615328565.648760323.flb', retry in 8 seconds: task_id=2, input=tail.5 > output=cloudwatch_logs.5 (out_id=5)

After that it exits with an error status and Kubernetes replaces the pod.

Curiously, several replicas of fluentbit failed with the same error at once. This makes me wonder if the CloudWatch API was briefly unavaiable. But if so, I’d expect the behaviour to be that it retries rather than taking down the whole fluentbit replica.

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 6
  • Comments: 58 (26 by maintainers)

Most upvoted comments

@byrneo @Funkerman1992 and everyone- update on this.

So basically we have two fixes in progress:

  1. The fix I gave in the previous comment where I shared an image. That was for a stop-gap measure I introduced to immediately auto-retry these invalid requests. I think it helps but it doesn’t fix the root cause.
  2. For the actual root cause, we are still uncertain, but we have made some progress. We have found a number of networking issues in the core of fluent bit which are affecting all of the AWS plugins. We’ve seen a number of reports which I suspect might all be cause by the same set of core networking bugs. We are working on fixing those. Hopefully this will permanently and fully fix the issue.

All of these fixes will take some time to make their way upstream. Right now, everyone can use our branches and pre-release/test builds if they want.

Core Network Fix Only Build

Code is here: https://github.com/krispraws/fluent-bit/commits/v1_7_5_openssl_fix

Image is here: 144718711470.dkr.ecr.us-west-2.amazonaws.com/core-network-fixes:1.7.5

Pull it with:

ecs-cli pull --region us-west-2 --registry-id 144718711470 144718711470.dkr.ecr.us-west-2.amazonaws.com/core-network-fixes:1.7.5

Core Network Fix with Sequence Token stop gap Build

Code is here: https://github.com/PettitWesley/fluent-bit/tree/v1_7_5_openssl_fix_sequence_token

Image is here: 144718711470.dkr.ecr.us-west-2.amazonaws.com/core-network-fixes:1.7.5-sequence-token-stop-gap

Pull it with:

ecs-cli pull --region us-west-2 --registry-id 144718711470 144718711470.dkr.ecr.us-west-2.amazonaws.com/core-network-fixes:1.7.5-sequence-token-stop-gap

Hope this helps/let me know what you see.

So I have made some progress in understanding the root cause of this issue. No promises, but we might have a fix by next week.

@byrneo Yea, this was contributed upstream and its much safer to use the newest release than my old image

@rpalanisamy The networking fixes have been included in 2.20.0. Please try out that version.

The sequence token stop gap wasn’t included in that release, may be a future one soon. I’m hoping that solving the networking issues will solve this, and that the stop gap fix won’t be needed.

@byrneo Please try:

144718711470.dkr.ecr.us-west-2.amazonaws.com/invalid-request-possible-fix:1.8.6

Hey folks, I had another idea on this one, if anyone is willing please try out the following image and let me know what you see in the logs:

144718711470.dkr.ecr.us-west-2.amazonaws.com/invalid-request-possible-fix:latest

This image has the repo policy here and can be pulled from any AWS account.

I think this patch might reduce the frequency with which you see these errors.

Same problem here.

Out of 10 fluentbit pods, 8 were stuck with this error. restarting the pods magically fixed everything, but only for a while.