vector: Possible Memory Leak with File source and Kafka sink

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Vector Version

vector 0.19.0 (x86_64-unknown-linux-gnu da60b55 2021-12-28)

Vector Configuration File

[sources.http_REDACTED]
data_dir = "/var/opt/vector/dbs"
type = "file"
include = ["/var/opt/logs-http-cf/json/REDACTED/*/*/*.json"]
read_from = "beginning"
ignore_older_secs = 43200
remove_after_secs = 1
oldest_first = true
ignore_checkpoints = true
fingerprint.strategy = "device_and_inode"

[sinks.kafka_REDACTED]
type = "kafka"
inputs = [ "http_REDACTED" ]
bootstrap_servers = "kafka-brokers-list"
message_timeout_ms = 300_000
socket_timeout_ms = 60_000
topic = "REDACTED"
compression = "gzip"
encoding.codec = "text"
buffer.max_events = 1000

Debug Output

Expected Behavior

Vector periodically releasing memory

Actual Behavior

Vector memory usage increases overtime

Additional Context

Hello, we are migrating some simple data pipelines from logstash to vector, these pipelines just reads json files and send the events in the json to a kafka topic, the files are cloudflare http requests and firewall events logs, so we have something around 15 to 20 pipelines.

At the moment we migrated two of those pipelines using the configuration shared before, we are using one .toml for each source, and it is working as expected, the only issue we find is that the memory usage of the vector process is increasing with the time and if we let the service running for a couple of days it will eventually consume all the memory on the server.

The logs are collected using custom python scripts that only download the json files and put them on a folder for vector to consume, the python scripts are called using crontab each minute, vector runs as a systemd process.

Vector is the only service (besides the systems services) running on this server, the load and cpu usage is pretty low, the server runs on GCP and have 8 vCPUs and 8 GB (7.63 GB), with vector stopped the memory usage is around 900 MB, when we start vector the memory starts increasing and will only be released if we restart the vector service, we tried reload with kill -1 PID, but it didn’t have any effect.

vector-leak

References

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 2
Comments: 19 (10 by maintainers)

Most upvoted comments

We believe this to be closed by https://github.com/vectordotdev/vector/pull/18634

jszwedko on Sep 25, 2023

I’ll close this issue since we are tracking in https://github.com/vectordotdev/vector/issues/11995 . Please follow along there.

jszwedko on Apr 25, 2022

Hi @leandrojmp ,

Gotcha, this does sound like it is the same issue as https://github.com/vectordotdev/vector/issues/11995 then, thanks for verifying! We plan to address that issue in the coming quarter so it should resolve this too.

jszwedko on Apr 1, 2022

Hi @leandrojmp ,

I think this might actually be the same issue as #11995 . Do you observe the internal_metrics_cardinality_total monotonically increasing?

jszwedko on Mar 31, 2022

Hello @jszwedko !

Let me try to give more context about this data.

data

The source of this data are logs from Cloudflare HTTP requests that are sent by Cloudflare to buckets in a cloud service, those files are downloaded to the server, a CentOS 8 VM, using a custom python script that is scheduled in the crontab, the glob mirrors the structure of the buckets.

I have to collect logs from multiple companies and each company can have multiple domains, the structure of the path used in the glob is like the following:

/var/opt/logs-http-cf/json/companyA/domainA/YYYYMMdd/*.json
/var/opt/logs-http-cf/json/companyA/domainB/YYYYMMdd/*.json
/var/opt/logs-http-cf/json/companyA/domainC/YYYYMMdd/*.json

/var/opt/logs-http-cf/json/companyB/domainX/YYYYMMdd/*.json
/var/opt/logs-http-cf/json/companyB/domainY/YYYYMMdd/*.json

configuration

Since I’m coming from a Logstash background, I tried to replicate the config file organization I had.

In logstash I used the pipelines.yml to configure one pipeline for each company, in this example the glob in the configuration for companyA would be /var/opt/logs-http-cf/json/companyA/*/*/*.json and for companyB it would be /var/opt/logs-http-cf/json/companyB/*/*/*.json, I’m following the same approach with vector configurations.

And since vector does not have nothing like the pipelines.yml used by logstash, to keep using one configuration per company I’ve created different .toml files, one for each company just changing the input and sink names and the glob, as the example below:

companyA.toml

[sources.http_companyA]
data_dir = "/var/opt/vector/dbs"
type = "file"
include = ["/var/opt/logs-http-cf/json/companyA/*/*/*.json"]
read_from = "beginning"
ignore_older_secs = 43200
remove_after_secs = 1
oldest_first = true
ignore_checkpoints = true
fingerprint.strategy = "device_and_inode"

[sinks.kafka_companyA]
type = "kafka"
inputs = [ "http_companyA" ]
bootstrap_servers = "kafka-brokers-list"
message_timeout_ms = 300_000
socket_timeout_ms = 60_000
topic = "http-companyA"
compression = "gzip"
encoding.codec = "text"
buffer.max_events = 1000

The companyB.toml would be the same, just changing every companyA to companyB.

I’m running just one vector instance, as a systemd service with the following configuration in the ExecStart.

ExecStart=/usr/bin/vector -c /etc/vector/pipelines/*.toml -qq

Inside /etc/vector/pipelines/ I have companyA.toml and companyB.toml, it is also running the dummy configuration in /etc/vector/vector.toml, I tried to remove it, but vector does not start without it.

files and document

The python script runs every minute and it downloads an average of ~ 500 files that will have an combined size around ~ 200 MB, the files have one json document per line.

The average line size is 2.5 KB and this is an example of how the lines looks like

{"BotScore":12,"BotScoreSrc":"Machine Learning","BotTags":["api"],"CacheCacheStatus":"dynamic","CacheResponseBytes":5439,"CacheResponseStatus":200,"CacheTieredFill":false,"ClientASN":"REDACTED","ClientCountry":"br","ClientDeviceType":"tablet","ClientIP":"REDACTED","ClientIPClass":"noRecord","ClientMTLSAuthCertFingerprint":"","ClientMTLSAuthStatus":"unknown","ClientRequestBytes":5753,"ClientRequestHost":"REDACTED","ClientRequestMethod":"GET","ClientRequestPath":"REDACTED","ClientRequestProtocol":"HTTP/2","ClientRequestReferer":"","ClientRequestScheme":"https","ClientRequestSource":"eyeball","ClientRequestURI":"REDACTED","ClientRequestUserAgent":"REDACTED","ClientSSLCipher":"AEAD-AES128-GCM-SHA256","ClientSSLProtocol":"TLSv1.3","ClientSrcPort":39008,"ClientTCPRTTMs":24,"ClientXRequestedWith":"","EdgeCFConnectingO2O":false,"EdgeColoCode":"SSA","EdgeColoID":416,"EdgeEndTimestamp":"2022-01-26T23:59:42Z","EdgePathingOp":"wl","EdgePathingSrc":"macro","EdgePathingStatus":"nr","EdgeRateLimitAction":"","EdgeRateLimitID":0,"EdgeRequestHost":"REDACTED","EdgeResponseBodyBytes":805,"EdgeResponseBytes":1854,"EdgeResponseCompressionRatio":4.63,"EdgeResponseContentType":"application/json","EdgeResponseStatus":200,"EdgeServerIP":"REDACTED","EdgeStartTimestamp":"2022-01-26T23:59:42Z","EdgeTimeToFirstByteMs":210,"FirewallMatchesActions":[],"FirewallMatchesRuleIDs":[],"FirewallMatchesSources":[],"OriginDNSResponseTimeMs":0,"OriginIP":"REDACTED","OriginRequestHeaderSendDurationMs":0,"OriginResponseBytes":0,"OriginResponseDurationMs":201,"OriginResponseHTTPExpires":"","OriginResponseHTTPLastModified":"","OriginResponseHeaderReceiveDurationMs":201,"OriginResponseStatus":200,"OriginResponseTime":201000000,"OriginSSLProtocol":"TLSv1.2","OriginTCPHandshakeDurationMs":0,"OriginTLSHandshakeDurationMs":0,"ParentRayID":"00","RayID":"REDACTED","SecurityLevel":"med","SmartRouteColoID":0,"UpperTierColoID":0,"WAFAction":"unknown","WAFFlags":"0","WAFMatchedVar":"","WAFProfile":"unknown","WAFRuleID":"","WAFRuleMessage":"","WorkerCPUTime":0,"WorkerStatus":"unknown","WorkerSubrequest":false,"WorkerSubrequestCount":0,"ZoneID":"REDACTED","ZoneName":"REDACTEDDOMAIN"}

If you need more information about the format of the documents, it can be found in the cloudflare documentation.

I’m not doing any parsing with vector, just reading the lines and sending to kafka, the parsing is still being done by Logstash consuming from kafka.

At the moment I have a total of 40 different pipelines, 20 for HTTP Requests, and 20 for Firewall Events, that have a similar document/size, just 2 of those pipelines are running on vector and the rest are still in logstash.

I’m planning in keep the same organization I have with logstash, one configuration file for each pipeline, so it would be at least 40 .toml files in the config path after I migrated everything. (Does this make any difference for vector?)

After vector reads the files and send the content to kafka, it deletes the files from disk

The processing of the files is really fast, when I migrated those 2 first pipelines from Logstash to Vector, I had a backlog of files matching the globs, something around 50k files, vector had no issue processing them, no sudden memory increase, the only issue at the moment is that the memory keeps increasing with the time.

debug

To run vector with valgrind I would need to check with the rest of the security team if we can do this test.

Also, I don’t know if it would help, but I could generate a strace output for the vector process.

I’m running vector direct in the host, it was installed using the x86_64 rpm package and I just updated to version 19.1.

I also could migrate more of my pipelines from Logstash to Vector to see if this has any influence in the memory increase.

Hope the explanation helps!

leandrojmp on Jan 28, 2022