vector: Possible Memory Leak with File source and Kafka sink
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Vector Version
vector 0.19.0 (x86_64-unknown-linux-gnu da60b55 2021-12-28)
Vector Configuration File
[sources.http_REDACTED]
data_dir = "/var/opt/vector/dbs"
type = "file"
include = ["/var/opt/logs-http-cf/json/REDACTED/*/*/*.json"]
read_from = "beginning"
ignore_older_secs = 43200
remove_after_secs = 1
oldest_first = true
ignore_checkpoints = true
fingerprint.strategy = "device_and_inode"
[sinks.kafka_REDACTED]
type = "kafka"
inputs = [ "http_REDACTED" ]
bootstrap_servers = "kafka-brokers-list"
message_timeout_ms = 300_000
socket_timeout_ms = 60_000
topic = "REDACTED"
compression = "gzip"
encoding.codec = "text"
buffer.max_events = 1000
Debug Output
Expected Behavior
Vector periodically releasing memory
Actual Behavior
Vector memory usage increases overtime
Additional Context
Hello, we are migrating some simple data pipelines from logstash to vector, these pipelines just reads json files and send the events in the json to a kafka topic, the files are cloudflare http requests and firewall events logs, so we have something around 15 to 20 pipelines.
At the moment we migrated two of those pipelines using the configuration shared before, we are using one .toml
for each source, and it is working as expected, the only issue we find is that the memory usage of the vector process is increasing with the time and if we let the service running for a couple of days it will eventually consume all the memory on the server.
The logs are collected using custom python scripts that only download the json files and put them on a folder for vector to consume, the python scripts are called using crontab each minute, vector runs as a systemd process.
Vector is the only service (besides the systems services) running on this server, the load and cpu usage is pretty low, the server runs on GCP and have 8 vCPUs and 8 GB (7.63 GB), with vector stopped the memory usage is around 900 MB, when we start vector the memory starts increasing and will only be released if we restart the vector service, we tried reload with kill -1 PID
, but it didn’t have any effect.
References
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 2
- Comments: 19 (10 by maintainers)
We believe this to be closed by https://github.com/vectordotdev/vector/pull/18634
I’ll close this issue since we are tracking in https://github.com/vectordotdev/vector/issues/11995 . Please follow along there.
Hi @leandrojmp ,
Gotcha, this does sound like it is the same issue as https://github.com/vectordotdev/vector/issues/11995 then, thanks for verifying! We plan to address that issue in the coming quarter so it should resolve this too.
Hi @leandrojmp ,
I think this might actually be the same issue as #11995 . Do you observe the
internal_metrics_cardinality_total
monotonically increasing?Hello @jszwedko !
Let me try to give more context about this data.
data
The source of this data are logs from Cloudflare HTTP requests that are sent by Cloudflare to buckets in a cloud service, those files are downloaded to the server, a CentOS 8 VM, using a custom python script that is scheduled in the crontab, the glob mirrors the structure of the buckets.
I have to collect logs from multiple companies and each company can have multiple domains, the structure of the path used in the glob is like the following:
configuration
Since I’m coming from a Logstash background, I tried to replicate the config file organization I had.
In logstash I used the
pipelines.yml
to configure one pipeline for each company, in this example the glob in the configuration forcompanyA
would be/var/opt/logs-http-cf/json/companyA/*/*/*.json
and forcompanyB
it would be/var/opt/logs-http-cf/json/companyB/*/*/*.json
, I’m following the same approach with vector configurations.And since vector does not have nothing like the
pipelines.yml
used by logstash, to keep using one configuration per company I’ve created different.toml
files, one for each company just changing the input and sink names and the glob, as the example below:companyA.toml
The
companyB.toml
would be the same, just changing everycompanyA
tocompanyB
.I’m running just one vector instance, as a systemd service with the following configuration in the
ExecStart
.Inside
/etc/vector/pipelines/
I havecompanyA.toml
andcompanyB.toml
, it is also running the dummy configuration in/etc/vector/vector.toml
, I tried to remove it, but vector does not start without it.files and document
The python script runs every minute and it downloads an average of ~ 500 files that will have an combined size around ~ 200 MB, the files have one json document per line.
The average line size is
2.5 KB
and this is an example of how the lines looks likeIf you need more information about the format of the documents, it can be found in the cloudflare documentation.
I’m not doing any parsing with vector, just reading the lines and sending to kafka, the parsing is still being done by Logstash consuming from kafka.
At the moment I have a total of 40 different pipelines, 20 for HTTP Requests, and 20 for Firewall Events, that have a similar document/size, just 2 of those pipelines are running on vector and the rest are still in logstash.
I’m planning in keep the same organization I have with logstash, one configuration file for each pipeline, so it would be at least 40
.toml
files in the config path after I migrated everything. (Does this make any difference for vector?)After vector reads the files and send the content to kafka, it deletes the files from disk
The processing of the files is really fast, when I migrated those 2 first pipelines from Logstash to Vector, I had a backlog of files matching the globs, something around 50k files, vector had no issue processing them, no sudden memory increase, the only issue at the moment is that the memory keeps increasing with the time.
debug
To run vector with
valgrind
I would need to check with the rest of the security team if we can do this test.Also, I don’t know if it would help, but I could generate a
strace
output for the vector process.I’m running vector direct in the host, it was installed using the
x86_64
rpm package and I just updated to version19.1
.I also could migrate more of my pipelines from Logstash to Vector to see if this has any influence in the memory increase.
Hope the explanation helps!