vector: Observed performance degradation in >= 0.13

Reported in discord: https://discord.com/channels/742820443487993987/746070591097798688/852811460412309504

A user reported they saw performance degradation upgrading from 0.12 to 0.13 (and 0.14):

Hi all, we’re using vector to get our production logs from kafka topics to elasticsearch (in AWS) for quite some time now, until yesterday with version 0.11.X. In peak, vector reads and writes about 250K msgs/sec, but after upgrading to 0.14.0 it never went past 150K/s. We downgraded do 0.13.1, but that didn’t help either. Finally, 0.12.2 does provide the performance we had before. Any ideas what might be causing this? We didn’t change anything in the config or other setup in between up/downgrades.

Config:

data_dir = "/data/vector"

# input topic
[sources.kafka-in]
type = "kafka"
bootstrap_servers = "${KAFKA_BROKERS}"
group_id = "logstash.inventory"
topics = ["topic"]

[transforms.message-extractor]
type = "json_parser"
inputs = ["kafka-in"]
drop_field = true
drop_invalid = true
field = "message"

[sinks.elastic-out]
type = "elasticsearch"
inputs = ["message-extractor"]
compression = "gzip"
healthcheck = true
host = "https://search-inventory-3i.eu-central-1.es.amazonaws.com"
index = "topic"
auth.strategy = "aws"
query.filter_path = "-took,-items.index._index,-items.index._type"



##### monitoring #######

# https://github.com/timberio/vector/issues/4148 
[ sources.metrics ]
type = "internal_metrics"

[ sinks.prometheus ]
type = "prometheus" # required
inputs = ["metrics"] # required
address = "0.0.0.0:10000" # required

We are running in OpenShift 3.9, I was deploying first the 0.14.X-alpine image from yesterday, then switch to 0.13.1-alpine, then to 0.12.2-alpine. AWS ElasticSearch is 7.8.0.

Kafka 2.6.1

Besides, we’re running this on 6 pods with 3 CPUs and 4G RAM each for this topic and the performance >=200K msgs/s

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 20 (20 by maintainers)

Commits related to this issue

Most upvoted comments

@jszwedko Thanks for opening this issue! Here are just a few more details:

Screenshot 2021-06-11 at 18 06 43

The peak just before 15:00 marks the deployment of 0.14.X-alpine, the peak just before 15>30 marks 0.13.1-alpine and the last peak after 16:00 marks the deployment with 0.12.2-alpine - notice the throughput plateau for the newer versions…

For completeness’ sake, our config now looks like this - having the kafka source decode the json also increased performance a little bit. But not as much as the changed *malloc…

data_dir = "/data/vector"

# topic
[sources.kafka-in]
type = "kafka"
bootstrap_servers = "${KAFKA_BROKERS}"
decoding.codec = "json"
group_id = "vector.inventory"
topics = ["topic"]

[sinks.elastic-out]
type = "elasticsearch"
inputs = ["kafka-in"]
compression = "gzip"
# force response compression
request.headers.Accept-Encoding = "gzip"
request.concurrency = "adaptive"
healthcheck = true
endpoint = "https://opensearch.inventory-logging.aws.cloud" 
bulk.index = "topic"
auth.strategy = "aws"
query.filter_path = "-took,-items.index._index,-items.index._type"

##### monitoring #######

# https://github.com/timberio/vector/issues/4148 
[ sources.metrics ]
type = "internal_metrics"

[ sinks.prometheus ]
type = "prometheus" # required
inputs = ["metrics"] # required
address = "0.0.0.0:10000" # required

Hey! Unfortunately we weren’t able to reproduce this in order to bisect down to figure out what the source of the regression is. I realize this is a lot to ask, but you could consider trying to bisect down yourself given you have an environment that manifests the issue. This would take the form of using Vector nightly builds between Vector 0.12 and Vector 0.13 to narrow it down (March 30th to April 29th): https://packages.timber.io/vector/nightly/. Would that be something you’d be open to trying? If you narrowed it down to a day then we should be able to help bisect it down further.