aws-for-fluent-bit: Crashes with "Could not find sequence token in response: response body is empty"
Hi folks,
We’ve been running the 2.12.0 release to ship our logs to CloudWatch with the new cloudwatch_logs plugin. We’ve been waiting for the fix to renewing STS tokens so this is our first outing with it.
After running reliably for several hours, several of our pods have crashed with:
[2021/03/09 22:21:54] [ info] [output:cloudwatch_logs:cloudwatch_logs.5] Created log stream fluent-bit-z64wb.application.<redacted>.log
[2021/03/09 22:22:55] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Could not find sequence token in response: response body is empty
[2021/03/09 22:22:55] [error] [src/flb_http_client.c:1163 errno=32] Broken pipe
[2021/03/09 22:22:55] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send log events
[2021/03/09 22:22:55] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send log events
[2021/03/09 22:22:55] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send events
[2021/03/09 22:22:56] [error] [output:cloudwatch_logs:cloudwatch_logs.4] Could not find sequence token in response: response body is empty
[lib/chunkio/src/cio_file.c:786 errno=9] Bad file descriptor
[2021/03/09 22:22:56] [error] [storage] [cio_file] error setting new file size on write
[2021/03/09 22:22:56] [error] [input chunk] error writing data from tail.5 instance
[lib/chunkio/src/cio_file.c:786 errno=9] Bad file descriptor
[2021/03/09 22:22:56] [error] [storage] [cio_file] error setting new file size on write
[2021/03/09 22:22:56] [error] [input chunk] error writing data from tail.5 instance
[2021/03/09 22:23:02] [ warn] [engine] failed to flush chunk '1-1615328565.648760323.flb', retry in 8 seconds: task_id=2, input=tail.5 > output=cloudwatch_logs.5 (out_id=5)
After that it exits with an error status and Kubernetes replaces the pod.
Curiously, several replicas of fluentbit failed with the same error at once. This makes me wonder if the CloudWatch API was briefly unavaiable. But if so, I’d expect the behaviour to be that it retries rather than taking down the whole fluentbit replica.
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 6
- Comments: 58 (26 by maintainers)
@byrneo @Funkerman1992 and everyone- update on this.
So basically we have two fixes in progress:
All of these fixes will take some time to make their way upstream. Right now, everyone can use our branches and pre-release/test builds if they want.
Core Network Fix Only Build
Code is here: https://github.com/krispraws/fluent-bit/commits/v1_7_5_openssl_fix
Image is here:
144718711470.dkr.ecr.us-west-2.amazonaws.com/core-network-fixes:1.7.5Pull it with:
Core Network Fix with Sequence Token stop gap Build
Code is here: https://github.com/PettitWesley/fluent-bit/tree/v1_7_5_openssl_fix_sequence_token
Image is here:
144718711470.dkr.ecr.us-west-2.amazonaws.com/core-network-fixes:1.7.5-sequence-token-stop-gapPull it with:
Hope this helps/let me know what you see.
So I have made some progress in understanding the root cause of this issue. No promises, but we might have a fix by next week.
@byrneo Yea, this was contributed upstream and its much safer to use the newest release than my old image
@rpalanisamy The networking fixes have been included in 2.20.0. Please try out that version.
The sequence token stop gap wasn’t included in that release, may be a future one soon. I’m hoping that solving the networking issues will solve this, and that the stop gap fix won’t be needed.
@byrneo Please try:
Hey folks, I had another idea on this one, if anyone is willing please try out the following image and let me know what you see in the logs:
This image has the repo policy here and can be pulled from any AWS account.
I think this patch might reduce the frequency with which you see these errors.
Same problem here.
Out of 10 fluentbit pods, 8 were stuck with this error. restarting the pods magically fixed everything, but only for a while.