vector: Unknown buffer disk limitation

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Hi,

We have multiple vector instances which are sending thier data to another central vector instance where I do some transformation. Basically N small vector instances are sending their requests to a big instance.

I have configured a large disk buffer to ensure no data is lost. Everything was working fine until 19:42 where everything stopped out of the sudden. The process itself was still up but nothing was processed anymore.

Vector logs were showing errors like that

Jan 31 19:42:13 vector-s3writer-bm vector[88628]: 2024-01-31T19:42:13.252148Z ERROR source{component_kind=“source” component_id=requests10 component_type=vector}:grpc-request{grpc_service=“vector.Vector” grpc_method=“PushEvents”}: vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=2798 reason=“Source send cancelled.” internal_log_rate_limit=true

$ ls -lah /var/lib/vector/buffer/v2/requests-vector-internal/ total 1.3G drwxr-xr-x 2 vector vector 4.0K Jan 31 19:41 . drwxr-xr-x 9 vector vector 4.0K Jan 30 15:19 … -rw-r----- 1 vector vector 127M Jan 30 15:19 buffer-data-0.dat -rw-r----- 1 vector vector 126M Jan 30 15:19 buffer-data-1.dat -rw-r----- 1 vector vector 127M Jan 30 15:19 buffer-data-10.dat -rw-r----- 1 vector vector 126M Jan 30 15:19 buffer-data-12.dat -rw-r----- 1 vector vector 128M Jan 30 15:19 buffer-data-2.dat -rw-r----- 1 vector vector 127M Jan 30 15:19 buffer-data-20.dat -rw-r----- 1 vector vector 125M Jan 30 15:19 buffer-data-23.dat -rw-r----- 1 vector vector 126M Jan 30 15:19 buffer-data-3.dat -rw-r----- 1 vector vector 128M Jan 30 15:19 buffer-data-4.dat -rw-r----- 1 vector vector 127M Jan 31 19:41 buffer-data-65534.dat -rw-r----- 1 vector vector 24 Jan 31 19:52 buffer.db -rw-r–r-- 1 vector vector 0 Jan 31 19:52 buffer.lock

The time vector stopped working is exact the time it would reach the file number 65535 which sounds like reaching a configured limit.

A restart from vector didn’t solve the issue. I had to delete the buffer files and start again to bring it back.

I hope that helps to find the root cause.

Regards,

Michael

Configuration

requests-vector-internal:
    type: vector
    compression: true
    healthcheck: false
    request:
      retry_attempts: 100
      timeout_secs: 60
      retry_max_duration_secs: 10
    batch:
      max_bytes: 10000000
      max_events: 30000
      timeout_secs: 10
    buffer:
      type: "disk"
      max_size: 159684354880
      when_full: "drop_newest"
    inputs:
      - requests10
    address: "127.0.0.1:9011"

Version

$ vector --version vector 0.34.2 (x86_64-unknown-linux-gnu d685a16 2024-01-02 14:59:54.890517436)

Debug Output

No response

Example Data

No response

Additional Context

I assume the reason is this pub fn get_data_file_path(&self, file_id: u16) -> PathBuf {

References

No response

About this issue

Original URL
State: open
Created 5 months ago
Reactions: 7
Comments: 16 (6 by maintainers)

Most upvoted comments

i think there are multiple circumstances.

drwxr-xr-x 2 vector vector 4.0K Feb  7 09:07 .
drwxr-xr-x 8 vector vector 4.0K Jan 29 22:26 ..
-rw-r----- 1 vector vector 124M Feb  7 09:08 buffer-data-33925.dat
-rw-r----- 1 vector vector   24 Feb  7 09:08 buffer.db
-rw-r--r-- 1 vector vector    0 Feb  7 09:08 buffer.lock

This is copied (old date) from a high throughput vector instance. On new sinks the filename starts with buffer-data-1.dat buffer-data-2.dat etc.

The files which are successfully processed are deleted and disappear. However the filename counter increases - even after restarts it continues to grow. It means after around 8TB written buffer files vector stopps working even though there might be only 2-3 active files in the folder.

From my observation and local tests i can confirm that this is always the case. As soon as it reaches the filenumber 65534 it stopps working.

if vector is at the point that it reached it’s maxmium file number it stopps processing new events without warning / error. On restarts there’s also no message which indicates an issue - it just spins to 100% CPU usage.

The only way to recover vector is to delete the buffer folder at all and restart the process. Which comes with a data-loss.

Having this behavior documented would bring transparency but wouldn’t solve this in my opinion.

m1cha3lf on Feb 10, 2024

i think we have to differentiate the root causes a little bit. If we agree on how to handle them i’m happy to provide a PR.

vector stops working after reaching buffer filename 65535. This one is pretty isolated and easy fixable in my opinion

https://github.com/vectordotdev/vector/blob/8d897af2f621a0402678141a3a94e1196ea56037/lib/vector-buffers/src/variants/disk_v2/ledger.rs#L122

The file ID counter is defined as u16 therefore ends at 65535. Regardless of how many files are in the folder. After 8 TB (128MB file size * 65535) of written buffer vector stops processing.

A possible solution is to change u16 to u32 and have some warning messages when it’s close to reach the limit. It’s important to not mix this limit with having 65k active files in the folder. This issue appears even if only 1 buffer file exists, because it counts +1 on every file processed.

Alternative solution is to use random strings or UUIDs instead of numbers.

data-loss on buffer files after corruption

Like described in 1) vector stops completely working with the buffer even though not everything was procesed. This might happen as well on InvalidProtobufPayload i assume.

@hillmandj can you share some data to reproduce the issue you’re facing? Maybe we can write a small reproducer which helps to find the root cause.

m1cha3lf on Feb 12, 2024

I’ll have to defer to @jszwedko here.

Our internal team structure around Vector has changed recently, so it’s harder to figure out where time can be budgeted.

tobz on Feb 14, 2024

So the 8TB limit is essentially a cap on the internal vector buffer database / state machine right?

It’s a limit on how many buffer data files can exist on disk at any moment. If you had a sink, for example, that was permanently blocked trying to send, and so the buffer just continued to grow and grow… it would be limited to 8TB (again, based on the maximum number of data files times their maximum allowed size.)

So if the files keep increasing, does that mean that our buffers (despite being configured way lower) are never flushing?

No. File IDs are incremented monotonically until they wrap at the limit. This means that even if the first file – buffer-data-0.dat – is fully processed and can be deleted… the next data file won’t reuse that file ID, but will instead roll over to buffer-data-1.dat, and so on.

It appears that what we need here is a way for us not to create a FATAL state in which a disk buffer requires to be wiped away entirely for pod to function. Buffer corruption errors should be handled, or at least give the users a choice as to how to handle it (i.e. crash, delete the whole buffer, or even better: delete only the file(s) that are corrupted)

Well, yes. This is a bug, and we obviously don’t intend for users to encounter bugs and have to do a bunch of manual remediation, especially when that remediation involves potentially losing data.

tobz on Feb 13, 2024

@m1cha3lf @jszwedko @zamazan4ik we are experiencing this, and other erratic behavior with disk buffers at scale. We also notice that sometimes when a buffer gets corrupted, new pods that spin up to try to pick up where it left off show this error:

2024-02-12T05:58:37.912771Z ERROR vector::topology::builder: Configuration error. error=Sink "datadog_egress": error occurred when building buffer: failed to build individual stage 0: failed to seek to position where reader left off: failed to decoded record: InvalidProtobufPayload

The only way to fix it is to do what @m1cha3lf said: delete the buffer and restart vector. What are your recommendations for mitigating this? We are considering having to switch to memory buffers to avoid this, but that would come at the expense of log durability and we’re not sure that will fit our business use case.

hillmandj on Feb 12, 2024