vector: S3 sink broken on FreeBSD when using buffer.type = "disk"

Hi, I’m trying out Vector 0.9.2 on FreeBSD 12.1. With this config

[sources.my_file]
type                = "file"
file_key            = ""
include             = ["/svc/vector/test.json"]
oldest_first        = true
max_line_bytes      = 1000000 #1MB
max_read_bytes      = 1000000 #1MB

[transforms.my_json_parser]
type                = "json_parser"
inputs              = ["my_file"]

[sinks.my_s3]
type                 = "aws_s3"
inputs               = ["my_json_parser"]
bucket               = "my.bucket"
key_prefix           = "my/"
batch.timeout_secs   = 10
encoding.codec       = "ndjson"
compression          = "gzip"
region               = "eu-west-1"
filename_append_uuid = false
filename_extension   = "json.gz"
filename_time_format = "%Y/%m/%d/%H%M%S"

everything works fine, but when I add

buffer.type          = "disk"
buffer.max_size      = 10490000
buffer.when_full     = "block"

Vector immediately starts using 100% cpu, no events are added to any batch (I’m running with -vv), and Vector hangs after shutting down with TERM (needs to be KILLed)

As an immediate workaround, I guess I don’t need a disk buffer when I’m ingesting from a log file (i.e. I assume Vector’s file checkpointing takes into account whether events have actually been submitted successfully)?

I can set up SSH access to a test box if that helps.

cheers!

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 29 (19 by maintainers)

Most upvoted comments

Vector buffers are basically redundant for a logfile managed by runit with log rotation etc (i.e. Vector doesn’t update it’s file position until all sinks have acked regardless of buffer setting) right?

The answer here is unfortunately somewhat complicated, so apologies in advance 😄

First of all, we agree that the behavior you describe would be ideal. We plan to implement that behavior, but there are a variety of complications we are still figuring out how we’d like to handle (e.g. sinks that flush data only periodically, sinks that have no explicit flush, sources that provide no ability to ack, transforms that intentionally drop events in between sources and sinks, etc).

In the meantime, we essentially provide two different modes of operation per sink. The default mode, with small in-memory buffers, is based around backpressure. In this mode, sinks “pull” data through the pipeline at whatever speed they’re able. This flows all the way upstream to sources, so a file source (for example) would be reading and checkpointing at roughly the same throughput as the downstream sink is sending. This prevents issues where this is a large gap between data that has been ingested and data that’s been fully processed. It’s not as precise as the ideal behavior, but provides some of the same benefits.

The second mode uses our disk buffers. The purpose of the disk buffers is to absorb and smooth out ingestion throughput variability in cases where backpressure would lead to data loss. A good example here is a UDP syslog source, where we have no way to signal the upstream system to slow down and need to simply accept the data as quickly as we can. If you’re using something like the file source, however, disk buffers are very likely redundant (unless your files are very aggressively rotated).

lukesteensen on Jul 20, 2020

Hm, I did manage to replicate (using the test.json file).

1) Manually started an DO instance (FreeBSD12.1 ZFS, 6cpu, 16gb, london)

2) SSHed in and
# freebsd-update fetch install
# pkg update
# pkg install ca_root_nss duo fish jq lsof mg mosh openjdk11 pstree py37-boto runit tmux tree
# pkg install gmake
# portsnap fetch extract
# pkg install  git bash rust
# wget https://github.com/timberio/vector/archive/v0.9.2.tar.gz
# tar xf v0.9.2.tar.gz
# cd vector-0.9.2
# gmake build
# mv target/release/vector /home/freebsd

3) cat run.sh
#!/bin/sh
chpst -e ./env ./vector --require-healthy -vv --config ./vector.toml

4) cat vector.toml
data_dir            = "/home/freebsd/var"

[sources.my_file]
type                = "file"
file_key            = ""
include             = ["/home/freebsd/test.json"]
oldest_first        = true
max_line_bytes      = 1000000 #1MB
max_read_bytes      = 1000000 #1MB

[transforms.my_json_parser]
type                = "json_parser"
inputs              = ["my_file"]

[sinks.my_s3]
type                 = "aws_s3"
inputs               = ["my_json_parser"]
bucket               = "my.bucket"
key_prefix           = "test/"
batch.timeout_secs   = 10
encoding.codec       = "ndjson"
compression          = "gzip"
region               = "eu-west-1"
filename_append_uuid = false
filename_extension   = "json.gz"
filename_time_format = "%Y/%m/%d/%H%M%S"
#buffer.type          = "disk"
#buffer.max_size      = 10490000
#buffer.when_full     = "block"

# eof

5) ls env
AWS_ACCESS_KEY_ID       AWS_SECRET_ACCESS_KEY

6) ./run
#works as expected

7) rm -rf var/*

8) uncomment buffer.*

9) ./run 
#100% cpu etc

0xYUANTI on Jul 24, 2020