fluent-bit: Storage backlog chunk validation failure on restart in 2.1.X (data loss)

Bug Report

Fluent Bit restarts result in storage backlog chunk validation failures in version 2.1.X.

To Reproduce

Here is a docker compose project that can be used to show the expected behavior in version 2.0.11 and the unexpected behavior in version 2.1.2:

https://github.com/amorey/flb-backlog-bug

  • Example log message:
[2023/04/27 06:11:37] [ info] [input:storage_backlog:storage_backlog.1] register tcp.0/1-1682575884.551412366.flb
[2023/04/27 06:11:37] [error] [input:storage_backlog:storage_backlog.1] chunk validation failed, data might be corrupted. No valid records found, the chunk will be discarded.
[2023/04/27 06:11:37] [error] [input:storage_backlog:storage_backlog.1] removing chunk tcp.0:1-1682575884.551412366.flb from the queue
  • Steps to reproduce the problem:
  1. Send a message to Fluent Bit that results in a flush failure and creates a new pending task
  2. Restart Fluent Bit

Expected behavior

On restart, previously pending tasks should be added to the storage backlog queue:

[2023/04/27 06:10:54] [ info] [input:storage_backlog:storage_backlog.1] register tcp.0/1-1682575828.625496799.flb
[2023/04/27 06:10:54] [ info] [input:storage_backlog:storage_backlog.1] queueing tcp.0:1-1682575828.625496799.flb

Screenshots

See https://github.com/amorey/flb-backlog-bug for log snippets.

Your Environment

Additional context

This bug will result in data loss on system restarts if tasks are pending.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 2
  • Comments: 18

Most upvoted comments

Thank you very much for taking the time to share your results. I don’t know the exact ETA for the release but I think it will be sooner rather than later, hopefully within this week.

I’ll send an update as soon as I have a proper ETA.

Hi @anosulchik, there is a PR for this issue that’s about to be merged and I think there will be a release early this week.