fluent-bit: scheduler corruption on high number of retries

originally reported on #1950 by @rmacian

2020/02/14 15:07:30] [debug] [storage] [cio file] alloc_size from 36864 to 102400
[2020/02/14 15:07:30] [debug] [storage] [cio file] synced at: tail.4/1-1581692850.200312113.flb
[2020/02/14 15:07:30] [debug] [in_tail] file=/var/log/containers/purchase-event-manager-5-vmmb9_gvp_purchase-event-manager-02060e929d3e460da6ac351689cd2f419d6ba192cced026a458110b095d93812.log read=32689 lines=71
[2020/02/14 15:07:30] [debug] [task] destroy task=0x7f14075faee0 (task_id=33)
[2020/02/14 15:07:30] [debug] [storage] tail.0:1-1581692755.636380503.flb mapped OK
[2020/02/14 15:07:30] [debug] [task] created task=0x7f14075faee0 id=33 OK
[2020/02/14 15:07:30] [debug] [out_forward] request 143790 bytes to flush
[2020/02/14 15:07:30] [debug] [out_fw] 154 entries tag='kube.var.log.containers.message-manager-north-5-f4cvk_tid_message-manager-north-4f3d19b958f53e2dc6e59555ba887569d0153f63a399ef1ba8d80b61f873e316.log' tag_len=148
[2020/02/14 15:07:30] [debug] [task] created task=0x7f140743ef40 id=126 OK
[2020/02/14 15:07:30] [debug] [out_forward] request 1239 bytes to flush
[2020/02/14 15:07:30] [debug] [out_fw] 1 entries tag='kube.var.log.containers.epg-agent-wait-2-46fc8_gvp_epg-agent-wait-e41122f15d432269e0156784949f781883d92669b0c5592300be5123a6060544.log' tag_len=134
[2020/02/14 15:07:30] [ warn] [task] retry for task 7 could not be re-scheduled
[2020/02/14 15:07:30] [debug] [retry] task retry=0x7f140740bae0, invalidated from the scheduler
[2020/02/14 15:07:30] [debug] [task] destroy task=0x7f14074400c0 (task_id=7)
[engine] caught signal (SIGSEGV)
#0  0x5653af67bebb      in  __mk_list_del() at lib/monkey/include/monkey/mk_core/mk_list.h:87
#1  0x5653af67bef2      in  mk_list_del() at lib/monkey/include/monkey/mk_core/mk_list.h:93
#2  0x5653af67c779      in  flb_sched_request_destroy() at src/flb_scheduler.c:314
#3  0x5653af67c8c7      in  flb_sched_event_handler() at src/flb_scheduler.c:375
#4  0x5653af67a17a      in  flb_engine_start() at src/flb_engine.c:548
#5  0x5653af5e8cc6      in  main() at src/fluent-bit.c:854
#6  0x7f14088b72e0      in  ???() at ???:0
#7  0x5653af5e71a9      in  ???() at ???:0
#8  0xffffffffffffffff  in  ???() at ???:0

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 15 (7 by maintainers)

Most upvoted comments

there are two errors:

  1. your Fluent Bit is getting network issues when writing the content to the output destination, you can see that in the error message “[out_fw] error writing content body”, so since that’s failing your file system queue is getting filed quickly, which is expected. You have to troubleshoot why the remote end-point is dropping the connection or if you have a balancer check what’s going on.

  2. The second problem is clearly a bug in Fluent Bit only highlighted under the conditions of the problem in 1 above. I am investigating in the code to see how it can be fixed.