fluent-bit: Windows Server 2019 fluent-bit.exe, version: 1.9.10.0 crash.

Bug Report

Describe the bug fluent-bit.exe, version: 1.9.10.0 process crashes from time to time when under load.

Faulting application name: fluent-bit.exe, version: 1.9.10.0, time stamp: 0x6382da67
Faulting module name: fluent-bit.exe, version: 1.9.10.0, time stamp: 0x6382da67
Exception code: 0xc0000005
Fault offset: 0x00000000001a2de3
Faulting process id: 0x940
Faulting application start time: 0x01d9151d667819e2
Faulting application path: C:\Program Files\fluent-bit\bin\fluent-bit.exe
Faulting module path: C:\Program Files\fluent-bit\bin\fluent-bit.exe
Report Id: 90a4fd94-383a-4e20-8579-3de4edff1c07
Faulting package full name: 
Faulting package-relative application ID: 

Steps to reproduce the problem:

  1. Fluent-bit 1.9.10 installed on Server 2019 (vanilla setup). Configuration #FLUENT-BIT 1.9.10 ##myconf.conf xxx [SERVICE] Flush 5 Log_Level Trace Daemon false Log_File C:\Program Files\fluent-bit\log\fluent-bit.log Storage.path C:\Program Files\fluent-bit\buffer Storage.sync normal Storage.checksum false Storage.backlog.mem_limit 5M @INCLUDE C:\Program Files\fluent-bit\conf\include_dummy.conf @INCLUDE C:\Program Files\fluent-bit\conf\include_os_metrics.conf xxx

##include_dummy.conf xxx [INPUT] Name tail Tag windows.dummy.${HOSTNAME} #Insert the log name below Path C:\Program Files\fluent-bit\log\access_log_20221221-111926.log Read_from_Head false Path_Key filename Storage.type filesystem DB winlog.sqlite Mem_Buf_Limit 50MB [OUTPUT] Name forward Match windows.dummy.${HOSTNAME} Upstream C:\Program Files\fluent-bit\conf\upstream.conf storage.total_limit_size 50M alias windows_agents_dummy xxx

##include_os_metrics.conf xxx [INPUT] Name windows_exporter_metrics Tag windows.osmetrics.${HOSTNAME} scrape_interval 2 alias windows_osmetrics [OUTPUT] Name forward Match windows.osmetrics.${HOSTNAME} Upstream C:\Program Files\fluent-bit\conf\upstream.conf storage.total_limit_size 50M alias windows_osmetrics xxx

##upstream.conf xxx [UPSTREAM] name myupstream [NODE] name myupstreamnode1 host 192.168.1.2 port 24224 xxx

  1. Another Server 2019 machine configuration (ip = 192.168.1.2, vanilla setup) #FLUEND 4.4.2 ======
<source> @type forward </source> <match windows.**> @type stdout </match> ======
  1. C:\Program Files\fluent-bit\log\access_log_20221221-111926.log file records generated with fake generator and the command is -> apache-fake-log-gen.py -n 0 -o LOG (more data from here https://github.com/mihkels/fake-log-generator)

Expected behavior Should not crash

Screenshots No screenshots

Your Environment

**Last messages from fluent-bit.log ** [2022/12/22 13:57:05] [debug] [input chunk] update output instances with new chunk size diff=69632 [2022/12/22 13:57:05] [debug] [input chunk] chunk 2368-1671710225.978788500.flb update plugin forward.0 fs_chunks_size by 69632 bytes, the current fs_chunks_size is 905091 bytes [2022/12/22 13:57:05] [debug] [input:tail:tail.0] 0 new files found on path ‘C:\Program Files\fluent-bit\log\access_log_20221221-111926.log’ [2022/12/22 13:57:05] [debug] [task] destroy task=000001F220F93DD0 (task_id=1) [2022/12/22 13:57:05] [debug] [input chunk] remove chunk 2368-1671710217.599708800.flb with 7914 bytes from plugin forward.1, the updated fs_chunks_size is 3957 bytes [2022/12/22 13:57:05] [debug] [input chunk] chunk 2368-1671710225.978788500.flb required 308 bytes and 49094909 bytes left in plugin forward.0 [2022/12/22 13:57:15] [debug] [input chunk] chunk 2368-1671710225.978788500.flb update plugin forward.0 fs_chunks_size by -23646 bytes, the current fs_chunks_size is 881445 bytes [2022/12/22 13:57:15] [debug] [task] created task=000001F220F93ED0 id=1 OK [2022/12/22 13:57:15] [debug] [output:forward:windows_agents_tanium] request 45923 bytes to flush [2022/12/22 13:57:15] [debug] [output:forward:windows_osmetrics] request 3957 bytes to flush [2022/12/22 13:57:15] [debug] [output:forward:windows_agents_tanium] task_id=1 assigned to thread #0 [2022/12/22 13:57:15] [debug] [upstream] KA connection #1752 to 192.168.50.149:24224 has been assigned (recycled) [2022/12/22 13:57:15] [debug] [upstream] KA connection #1480 to 192.168.50.195:24224 has been assigned (recycled) [2022/12/22 13:57:15] [debug] [task] created task=000001F21F3AAE40 id=2 OK [2022/12/22 13:57:15] [debug] [upstream] KA connection #1752 to 192.168.50.149:24224 is now available [2022/12/22 13:57:15] [debug] [out flush] cb_destroy coro_id=9609 [2022/12/22 13:57:15] [debug] [upstream] KA connection #1480 to 192.168.50.195:24224 is now available [2022/12/22 13:57:15] [debug] [output:forward:windows_osmetrics] task_id=2 assigned to thread #1 [2022/12/22 13:57:15] [debug] [out flush] cb_destroy coro_id=9603 [2022/12/22 13:57:15] [debug] [input chunk] chunk 2368-1671710235.800341500.flb required 3957 bytes and 49996043 bytes left in plugin forward.1 [2022/12/22 13:57:15] [debug] [input chunk] update output instances with new chunk size diff=3957 [2022/12/22 13:57:15] [debug] [input chunk] chunk 2368-1671710235.800341500.flb update plugin forward.1 fs_chunks_size by 3957 bytes, the current fs_chunks_size is 7914 bytes [2022/12/22 13:57:15] [debug] [input chunk] chunk 2368-1671710235.801076400.flb required 45558 bytes and 49118555 bytes left in plugin forward.0 [2022/12/22 13:57:15] [debug] [input chunk] update output instances with new chunk size diff=69632 [2022/12/22 13:57:15] [debug] [input chunk] chunk 2368-1671710235.801076400.flb update plugin forward.0 fs_chunks_size by 69632 bytes, the current fs_chunks_size is 951077 bytes [2022/12/22 13:57:15] [debug] [task] destroy task=000001F21F3AAE40 (task_id=2) [2022/12/22 13:57:15] [debug] [input chunk] remove chunk 2368-1671710225.977901900.flb with 3957 bytes from plugin forward.1, the updated fs_chunks_size is 3957 bytes

WinDbg output Microsoft ® Windows Debugger Version 10.0.25200.1003 AMD64 Copyright © Microsoft Corporation. All rights reserved.

Loading Dump File [D:\DO_NOT_DELETE\CASES\00804083_Fluent_Bit\CRASH\fluent-bit.exe_221222_135706.dmp] User Mini Dump File with Full Memory: Only application data is available

Comment: ’ *** procdump64.exe -ma -e fluent-bit.exe *** Unhandled exception: C0000005.ACCESS_VIOLATION’

************* Path validation summary ************** Response Time (ms) Location Deferred srv* Symbol search path is: srv* Executable search path is: Windows 10 Version 17763 MP (4 procs) Free x64 Product: Server, suite: TerminalServer DataCenter SingleUserTS Edition build lab: 17763.1.amd64fre.rs5_release.180914-1434 Machine Name: Debug session time: Thu Dec 22 13:57:08.000 2022 (UTC + 2:00) System Uptime: 1 days 22:53:36.323 Process Uptime: 1 days 2:37:08.000 … Loading unloaded module list … This dump file has an exception of interest stored in it. The stored exception information can be accessed via .ecxr. (940.110): Access violation - code c0000005 (first/second chance not available) For analysis of this file, run !analyze -v fluent_bit!snprintf+0x27fe3: 00007ff6df492de3 4d0fbe10 movsx r10,byte ptr [r8] ds:000001f2215c003f=?? 0:004> !analyze -v


  •                                                                         *
    
  •                    Exception Analysis                                   *
    
  •                                                                         *
    

KEY_VALUES_STRING: 1

Key  : AV.Fault
Value: Read

Key  : Analysis.CPU.mSec
Value: 905

Key  : Analysis.DebugAnalysisManager
Value: Create

Key  : Analysis.Elapsed.mSec
Value: 15739

Key  : Analysis.IO.Other.Mb
Value: 14

Key  : Analysis.IO.Read.Mb
Value: 0

Key  : Analysis.IO.Write.Mb
Value: 19

Key  : Analysis.Init.CPU.mSec
Value: 171

Key  : Analysis.Init.Elapsed.mSec
Value: 4262

Key  : Analysis.Memory.CommitPeak.Mb
Value: 81

Key  : Timeline.OS.Boot.DeltaSec
Value: 168816

Key  : Timeline.Process.Start.DeltaSec
Value: 95828

Key  : WER.OS.Branch
Value: rs5_release

Key  : WER.OS.Timestamp
Value: 2018-09-14T14:34:00Z

Key  : WER.OS.Version
Value: 10.0.17763.1

Key  : WER.Process.Version
Value: 1.9.10.0

FILE_IN_CAB: fluent-bit.exe_221222_135706.dmp

COMMENT:
*** procdump64.exe -ma -e fluent-bit.exe *** Unhandled exception: C0000005.ACCESS_VIOLATION

NTGLOBALFLAG: 0

APPLICATION_VERIFIER_FLAGS: 0

CONTEXT: (.ecxr) rax=00000000000cbf44 rbx=000000906baff9a0 rcx=000000906baff9a0 rdx=000000906baff950 rsi=000001f220dece30 rdi=000000906baff950 rip=00007ff6df492de3 rsp=000000906baff8f0 rbp=000000906baff920 r8=000001f2215c003f r9=0000000000000030 r10=00007ff6df2f0000 r11=00007ff6df2f0000 r12=0000000000000000 r13=00007ff6df3f4cb0 r14=000001f2215c003f r15=000001f220d841c0 iopl=0 nv up ei pl nz na pe nc cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010200 fluent_bit!snprintf+0x27fe3: 00007ff6df492de3 4d0fbe10 movsx r10,byte ptr [r8] ds:000001f2215c003f=?? Resetting default scope

EXCEPTION_RECORD: (.exr -1) ExceptionAddress: 00007ff6df492de3 (fluent_bit!snprintf+0x0000000000027fe3) ExceptionCode: c0000005 (Access violation) ExceptionFlags: 00000000 NumberParameters: 2 Parameter[0]: 0000000000000000 Parameter[1]: 000001f2215c003f Attempt to read from address 000001f2215c003f

PROCESS_NAME: fluent-bit.exe

READ_ADDRESS: 000001f2215c003f

ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%p referenced memory at 0x%p. The memory could not be %s.

EXCEPTION_CODE_STR: c0000005

EXCEPTION_PARAMETER1: 0000000000000000

EXCEPTION_PARAMETER2: 000001f2215c003f

STACK_TEXT:
000000906baff8f0 00007ff6df48bf9b : 000001f220d841c0 000000906baff9a0 0000000000000000 0000000000000000 : fluent_bit!snprintf+0x27fe3 000000906baff930 00007ff6df39896a : 0000000000000000 0000044e847301fc 000000906baffa30 00007ff6df3f66fb : fluent_bit!snprintf+0x2119b 000000906baff980 00007ff6df3f3b06 : 000001f220f847b0 000001f220dece30 000001f220dece30 0000000000000000 : fluent_bit!flb_time_now+0x14aaa 000000906baffa00 00007ff6df3f2e09 : 00000000ffffffff 000000906baffb91 000001f220d84d30 00007ff6df902798 : fluent_bit!sprintf+0x1a996 000000906baffb10 00007ff6df3a4f81 : 0000000000000000 0000000000000000 000001f21f3d57e0 0000000000000001 : fluent_bit!sprintf+0x19c99 000000906baffbf0 00007ff6df81036e : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 : fluent_bit!sscanf+0x5c91 000000906baffca0 00007fff872a1aa1 : 00007ff6df3a4ec0 0000000000000000 0000000000000000 0000000000000000 : fluent_bit!mk_main+0x343ee 000000906baffcd0 00007fff8afe03fa : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 : KERNELBASE!BaseFiberStart+0x21 000000906baffd00 0000000000000000 : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 : ntdll!RtlUserFiberStart+0x1a

STACK_COMMAND: ~4s; .ecxr ; kb

SYMBOL_NAME: fluent_bit+27fe3

MODULE_NAME: fluent_bit

IMAGE_NAME: fluent-bit.exe

FAILURE_BUCKET_ID: INVALID_POINTER_READ_c0000005_fluent-bit.exe!Unknown

OS_VERSION: 10.0.17763.1

BUILDLAB_STR: rs5_release

OSPLATFORM_TYPE: x64

OSNAME: Windows 10

IMAGE_VERSION: 1.9.10.0

FAILURE_ID_HASH: {c553cc7a-b095-dfc3-fbc1-9d5091316b7c}

Followup: MachineOwner

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 29 (6 by maintainers)

Most upvoted comments

reproduced on 2.0.9

I have been investigating this issue as we have been observing it since updating from v1.9.8 to major version 2.

I have reason to believe the issue is resolved in the 2.0 branch (eventually to be v2.0.12). While we were investigating off v2.0.10, we happened to notice that the culprit of the access violation is the buffer address in the event chunk. This buffer address is read from the input chunk during the output flush process.

I think this commit fixes the issue: https://github.com/fluent/fluent-bit/commit/a792c10740cb79115547465e5ce0eea50037be63

@leonardo-albertovich helped me understand some of the history and In 1.9.9, chunkio was updated to use mapped files on Windows. With mapped files on Windows, sometimes locking the chunk causes the file to be remapped. Because the chunk lock used to happen after reading the buffer data, it lead to a sporadic TOCTOU bug as the buffer data address no longer has input chunk data in it (the mapping has changed). Locking the chunk before getting the buffer address resolves that issue. (Leonardo feel free to correct me if I got any info wrong).

If any folks are still affected by this, I would recommend trying to reproduce off the 2.0 branch if possible, or wait for 2.0.12 and try to reproduce then.

@MrTomasz we picked out that commit for investigation after looking at the dump. unfortunately, the dump we have is from calyptia and the pdb in the fluent-bit zip doesn’t work with it. Depending on how things come tomorrow, I might be able to generate my own pdb, at which point I should be able to see if what we are seeing in that commit is a red herring or not.

For me - it does crash when using filesystem storage for the input tail logs, however when using memory-based - it doesn’t crash and works well, but other things happen (see #6748 ).

From what I’ve seen, after some time, processing logic hits some sort of a msgpack’ed buffer misalignment when unpacking msgpack’ed log buffers into json (I use Loki output). The read pointer got eventually misaligned (wrong parsing?) and msgpack unpacker started misaligned parsing which ended up in crash as it hit allocated buffer boundary, resulting in ACCESS_VIOLATION crash. It basically stopped hitting MSGPACK_ARRAY objects and started parsing an MSGPACK_ARRAY object internals, trying to find the MSGPACK_ARRAY itself.

I have updated msgpack to latest v5.0.0, but it didn’t help.

@douglasawh I don’t know yet how the commit you’ve mentioned (onigmo lib update) is affecting this, but I’ll try reverting it, just because I ran out of options.