vector: Memory leak with kafka source, http sink and `429 code` responses

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

We have a configuration with a kafka source and an http sink, and with acknowledgements enabled. And it works fine when the http sink only receives successful responses, but when the sink receives 429 responses, the memory starts growing until the vector is killed by the OOM.

Also I tried turning on allocation-tracing and the graph looks like there is a memory leak in the kafka source.

Some graphs: Memory usage from vector node

Internal allocation tracing, all components:

Internal allocation tracing, only kafka source:

Rate of http sink requests with response code:

Configuration

data_dir: /tmp/vector-data-dir
api:
  enabled: true
  address: "127.0.0.1:8686"
  playground: true
log_schema:
  host_key: host
  message_key: message
  source_type_key: source_type
  timestamp_key: timestamp
sources:
  main_input_kafka_src:
    type: kafka
    bootstrap_servers: "bootstrap.brokers.kafka:443"
    group_id: vector_test_local
    auto_offset_reset: earliest
    topics:
      - test-topic
    librdkafka_options:
      fetch.message.max.bytes: "10485760" # 10 MB
      fetch.max.bytes: "104857600" # 100 MB
sinks:
  test_sink:
    type: "http"
    inputs:
      - main_input_kafka_src
    uri: "http://localhost:8000"
    method: "post"
    acknowledgements:
      enabled: true
    buffer:
      type: "memory"
      max_events: 100000
      when_full: "block"
    batch:
      max_bytes: 20971520
      timeout_secs: 10
    encoding:
      codec: "json"
    request:
      concurrency: 250

Version

0.29.1

Debug Output

2023-05-02T05:33:20.934810Z  WARN sink{component_kind="sink" component_id=test_sink component_type=http component_name=test_sink}:request{request_id=817}: vector::sinks::util::retries: Retrying after response. reason=too many requests internal_log_rate_limit=true
2023-05-02T05:33:21.419138Z  WARN sink{component_kind="sink" component_id=test_sink component_type=http component_name=test_sink}:request{request_id=636}: vector::sinks::util::retries: Internal log [Retrying after response.] is being rate limited.
2023-05-02T05:33:31.622858Z  WARN sink{component_kind="sink" component_id=test_sink component_type=http component_name=test_sink}:request{request_id=284}: vector::sinks::util::retries: Internal log [Retrying after response.] has been rate limited 33 times.
2023-05-02T05:33:31.622918Z  WARN sink{component_kind="sink" component_id=test_sink component_type=http component_name=test_sink}:request{request_id=284}: vector::sinks::util::retries: Retrying after response. reason=too many requests internal_log_rate_limit=true

Example Data

No response

Additional Context

For MRE: I created an http server for http sink with simple logic: when concurrent requests < 20 then sleeps for 40 seconds and returns success, otherwise returns 429 immediately. The content of the kafka topic is not important, just need enough amount. Vector running in k8s.

References

No response

About this issue

Original URL
State: closed
Created a year ago
Reactions: 1
Comments: 24 (20 by maintainers)

Most upvoted comments

Hi @Ilmarii,

I believe this should be resolved in the latest version (v0.34.0). We fixed a memory leak in the http sink (https://github.com/vectordotdev/vector/pull/18637) that would be triggered when the downstream service returns 429 and also refactored the Kafka source to better handle acknowledgements (https://github.com/vectordotdev/vector/pull/17497). Let us know if you still experience issues after upgrading.

dsmith3197 on Nov 8, 2023

Well, with acknowledges disabled there is no memory leak 😃

Ilmarii on May 15, 2023

@spencergilbert I built and tested vector from spencer/improve-selects and the memory leak still exists.

Ilmarii on May 15, 2023

Fairly certain I’ve ~~solved this~~ improved this - confirming locally, and should have a PR to close it later today if testing goes well.

@Ilmarii would you be able to run a nightly version to see if my changes help enough in your environment

spencergilbert on May 12, 2023