vector: S3 sink broken on FreeBSD when using buffer.type = "disk"
Hi, I’m trying out Vector 0.9.2 on FreeBSD 12.1. With this config
[sources.my_file]
type = "file"
file_key = ""
include = ["/svc/vector/test.json"]
oldest_first = true
max_line_bytes = 1000000 #1MB
max_read_bytes = 1000000 #1MB
[transforms.my_json_parser]
type = "json_parser"
inputs = ["my_file"]
[sinks.my_s3]
type = "aws_s3"
inputs = ["my_json_parser"]
bucket = "my.bucket"
key_prefix = "my/"
batch.timeout_secs = 10
encoding.codec = "ndjson"
compression = "gzip"
region = "eu-west-1"
filename_append_uuid = false
filename_extension = "json.gz"
filename_time_format = "%Y/%m/%d/%H%M%S"
everything works fine, but when I add
buffer.type = "disk"
buffer.max_size = 10490000
buffer.when_full = "block"
Vector immediately starts using 100% cpu, no events are added to any batch (I’m running with -vv), and Vector hangs after shutting down with TERM (needs to be KILLed)
As an immediate workaround, I guess I don’t need a disk buffer when I’m ingesting from a log file (i.e. I assume Vector’s file checkpointing takes into account whether events have actually been submitted successfully)?
I can set up SSH access to a test box if that helps.
cheers!
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 29 (19 by maintainers)
The answer here is unfortunately somewhat complicated, so apologies in advance 😄
First of all, we agree that the behavior you describe would be ideal. We plan to implement that behavior, but there are a variety of complications we are still figuring out how we’d like to handle (e.g. sinks that flush data only periodically, sinks that have no explicit flush, sources that provide no ability to ack, transforms that intentionally drop events in between sources and sinks, etc).
In the meantime, we essentially provide two different modes of operation per sink. The default mode, with small in-memory buffers, is based around backpressure. In this mode, sinks “pull” data through the pipeline at whatever speed they’re able. This flows all the way upstream to sources, so a file source (for example) would be reading and checkpointing at roughly the same throughput as the downstream sink is sending. This prevents issues where this is a large gap between data that has been ingested and data that’s been fully processed. It’s not as precise as the ideal behavior, but provides some of the same benefits.
The second mode uses our disk buffers. The purpose of the disk buffers is to absorb and smooth out ingestion throughput variability in cases where backpressure would lead to data loss. A good example here is a UDP syslog source, where we have no way to signal the upstream system to slow down and need to simply accept the data as quickly as we can. If you’re using something like the file source, however, disk buffers are very likely redundant (unless your files are very aggressively rotated).
Hm, I did manage to replicate (using the test.json file).