vector: Unknown buffer disk limitation
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
Hi,
We have multiple vector instances which are sending thier data to another central vector instance where I do some transformation. Basically N small vector instances are sending their requests to a big instance.
I have configured a large disk buffer to ensure no data is lost. Everything was working fine until 19:42 where everything stopped out of the sudden. The process itself was still up but nothing was processed anymore.
Vector logs were showing errors like that
Jan 31 19:42:13 vector-s3writer-bm vector[88628]: 2024-01-31T19:42:13.252148Z ERROR source{component_kind=“source” component_id=requests10 component_type=vector}:grpc-request{grpc_service=“vector.Vector” grpc_method=“PushEvents”}: vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=2798 reason=“Source send cancelled.” internal_log_rate_limit=true
$ ls -lah /var/lib/vector/buffer/v2/requests-vector-internal/ total 1.3G drwxr-xr-x 2 vector vector 4.0K Jan 31 19:41 . drwxr-xr-x 9 vector vector 4.0K Jan 30 15:19 … -rw-r----- 1 vector vector 127M Jan 30 15:19 buffer-data-0.dat -rw-r----- 1 vector vector 126M Jan 30 15:19 buffer-data-1.dat -rw-r----- 1 vector vector 127M Jan 30 15:19 buffer-data-10.dat -rw-r----- 1 vector vector 126M Jan 30 15:19 buffer-data-12.dat -rw-r----- 1 vector vector 128M Jan 30 15:19 buffer-data-2.dat -rw-r----- 1 vector vector 127M Jan 30 15:19 buffer-data-20.dat -rw-r----- 1 vector vector 125M Jan 30 15:19 buffer-data-23.dat -rw-r----- 1 vector vector 126M Jan 30 15:19 buffer-data-3.dat -rw-r----- 1 vector vector 128M Jan 30 15:19 buffer-data-4.dat -rw-r----- 1 vector vector 127M Jan 31 19:41 buffer-data-65534.dat -rw-r----- 1 vector vector 24 Jan 31 19:52 buffer.db -rw-r–r-- 1 vector vector 0 Jan 31 19:52 buffer.lock
The time vector stopped working is exact the time it would reach the file number 65535 which sounds like reaching a configured limit.
A restart from vector didn’t solve the issue. I had to delete the buffer files and start again to bring it back.
I hope that helps to find the root cause.
Regards,
Michael
Configuration
requests-vector-internal:
type: vector
compression: true
healthcheck: false
request:
retry_attempts: 100
timeout_secs: 60
retry_max_duration_secs: 10
batch:
max_bytes: 10000000
max_events: 30000
timeout_secs: 10
buffer:
type: "disk"
max_size: 159684354880
when_full: "drop_newest"
inputs:
- requests10
address: "127.0.0.1:9011"
Version
$ vector --version vector 0.34.2 (x86_64-unknown-linux-gnu d685a16 2024-01-02 14:59:54.890517436)
Debug Output
No response
Example Data
No response
Additional Context
I assume the reason is this
pub fn get_data_file_path(&self, file_id: u16) -> PathBuf {
References
No response
About this issue
- Original URL
- State: open
- Created 5 months ago
- Reactions: 7
- Comments: 16 (6 by maintainers)
i think there are multiple circumstances.
This is copied (old date) from a high throughput vector instance. On new sinks the filename starts with buffer-data-1.dat buffer-data-2.dat etc.
The files which are successfully processed are deleted and disappear. However the filename counter increases - even after restarts it continues to grow. It means after around 8TB written buffer files vector stopps working even though there might be only 2-3 active files in the folder.
From my observation and local tests i can confirm that this is always the case. As soon as it reaches the filenumber 65534 it stopps working.
The only way to recover vector is to delete the buffer folder at all and restart the process. Which comes with a data-loss.
Having this behavior documented would bring transparency but wouldn’t solve this in my opinion.
i think we have to differentiate the root causes a little bit. If we agree on how to handle them i’m happy to provide a PR.
https://github.com/vectordotdev/vector/blob/8d897af2f621a0402678141a3a94e1196ea56037/lib/vector-buffers/src/variants/disk_v2/ledger.rs#L122
The file ID counter is defined as
u16therefore ends at 65535. Regardless of how many files are in the folder. After 8 TB (128MB file size * 65535) of written buffer vector stops processing.A possible solution is to change u16 to u32 and have some warning messages when it’s close to reach the limit. It’s important to not mix this limit with having 65k active files in the folder. This issue appears even if only 1 buffer file exists, because it counts +1 on every file processed.
Alternative solution is to use random strings or UUIDs instead of numbers.
Like described in 1) vector stops completely working with the buffer even though not everything was procesed. This might happen as well on
InvalidProtobufPayloadi assume.@hillmandj can you share some data to reproduce the issue you’re facing? Maybe we can write a small reproducer which helps to find the root cause.
I’ll have to defer to @jszwedko here.
Our internal team structure around Vector has changed recently, so it’s harder to figure out where time can be budgeted.
It’s a limit on how many buffer data files can exist on disk at any moment. If you had a sink, for example, that was permanently blocked trying to send, and so the buffer just continued to grow and grow… it would be limited to 8TB (again, based on the maximum number of data files times their maximum allowed size.)
No. File IDs are incremented monotonically until they wrap at the limit. This means that even if the first file –
buffer-data-0.dat– is fully processed and can be deleted… the next data file won’t reuse that file ID, but will instead roll over tobuffer-data-1.dat, and so on.Well, yes. This is a bug, and we obviously don’t intend for users to encounter bugs and have to do a bunch of manual remediation, especially when that remediation involves potentially losing data.
@m1cha3lf @jszwedko @zamazan4ik we are experiencing this, and other erratic behavior with
diskbuffers at scale. We also notice that sometimes when a buffer gets corrupted, new pods that spin up to try to pick up where it left off show this error:The only way to fix it is to do what @m1cha3lf said: delete the buffer and restart vector. What are your recommendations for mitigating this? We are considering having to switch to memory buffers to avoid this, but that would come at the expense of log durability and we’re not sure that will fit our business use case.