fluent-bit: chunkio: corrupt files are not removed after format check failure
Bug Report
Describe the bug
The same one as #2472. fluent-bit writes a lot of format check failed errors.
fluent-bit ends up having a lot of corrupt files (I had more than 6,000 files in three weeks before I manually deleted them). I don’t think files should break so often, but that’s another issue.
But the exact steps to reproduce this is still unknown.
To Reproduce
- Rubular link if applicable:
- Example log message if applicable:
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/153481-1635124800.968626242.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/176355-1634936209.242908772.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/12048-1635050818.818578912.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/176355-1634958778.842931521.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/358011-1635205582.68650748.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/176355-1634948938.792934702.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/176355-1634949038.342960395.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/12048-1635068888.54486252.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/153481-1635127142.218614831.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/12048-1635060898.868574974.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/176355-1634954400.142932279.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/358011-1635197997.218599676.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/153481-1635126782.593821399.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/12048-1635076412.268625949.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/358011-1635205497.218642296.flb
- Steps to reproduce the problem: not known
Expected behavior
fluent-bit deletes the file.
Screenshots
Your Environment
- Version used: 1.8.9
- Configuration: invoked by google-cloud-ops-agent
- Environment name and version (e.g. Kubernetes? What version?): Google Compute Engine
- Server type and version:
- Operating System and version: Debian 11
- Filters and plugins: in_tail
[2021/11/03 20:00:14] [ info] [storage] version=1.1.5, initializing...
[2021/11/03 20:00:14] [ info] [storage] root path '/var/lib/google-cloud-ops-agent/fluent-bit/buffers'
[2021/11/03 20:00:14] [ info] [storage] normal synchronization mode, checksum enabled, max_chunks_up=128
[2021/11/03 20:00:14] [ info] [storage] backlog input plugin: storage_backlog.3
[2021/11/03 20:00:14] [ info] [cmetrics] version=0.2.2
[2021/11/03 20:00:14] [ info] [input:storage_backlog:storage_backlog.3] queue memory limit: 47.7M
Additional context
I think there’s a case that’s not handled by the chunk io library.
mmap_file emits the “format check failed” message and returns CIO_CORRUPTED. There’s a path from flb_engine_dispatch > flb_input_chunk_flush > cio_chunk_up > cio_file_up > _cio_file_up > mmap_file.
_cio_file_up doesn’t handle errors if the returned value is not exactly CIO_ERROR (it’s CIO_CORRUPTED in this case). So the value gets propagated back to flb_engine_dispatch as NULL (there’s another path to mmap_file through cio_chunk_get_content but again the caller only checks if the return value is -1), and it’s essentially a noop and the same corrupt entry is processed again.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 16 (8 by maintainers)
fluent-bit exits immediately when you have corrupt files.