vector: Memory leak with kafka source, http sink and `429 code` responses

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

We have a configuration with a kafka source and an http sink, and with acknowledgements enabled. And it works fine when the http sink only receives successful responses, but when the sink receives 429 responses, the memory starts growing until the vector is killed by the OOM.

Also I tried turning on allocation-tracing and the graph looks like there is a memory leak in the kafka source.

Some graphs: Memory usage from vector node Memory usage from vector node

Internal allocation tracing, all components: image

Internal allocation tracing, only kafka source: image

Rate of http sink requests with response code: image

Configuration

data_dir: /tmp/vector-data-dir
api:
  enabled: true
  address: "127.0.0.1:8686"
  playground: true
log_schema:
  host_key: host
  message_key: message
  source_type_key: source_type
  timestamp_key: timestamp
sources:
  main_input_kafka_src:
    type: kafka
    bootstrap_servers: "bootstrap.brokers.kafka:443"
    group_id: vector_test_local
    auto_offset_reset: earliest
    topics:
      - test-topic
    librdkafka_options:
      fetch.message.max.bytes: "10485760" # 10 MB
      fetch.max.bytes: "104857600" # 100 MB
sinks:
  test_sink:
    type: "http"
    inputs:
      - main_input_kafka_src
    uri: "http://localhost:8000"
    method: "post"
    acknowledgements:
      enabled: true
    buffer:
      type: "memory"
      max_events: 100000
      when_full: "block"
    batch:
      max_bytes: 20971520
      timeout_secs: 10
    encoding:
      codec: "json"
    request:
      concurrency: 250

Version

0.29.1

Debug Output

2023-05-02T05:33:20.934810Z  WARN sink{component_kind="sink" component_id=test_sink component_type=http component_name=test_sink}:request{request_id=817}: vector::sinks::util::retries: Retrying after response. reason=too many requests internal_log_rate_limit=true
2023-05-02T05:33:21.419138Z  WARN sink{component_kind="sink" component_id=test_sink component_type=http component_name=test_sink}:request{request_id=636}: vector::sinks::util::retries: Internal log [Retrying after response.] is being rate limited.
2023-05-02T05:33:31.622858Z  WARN sink{component_kind="sink" component_id=test_sink component_type=http component_name=test_sink}:request{request_id=284}: vector::sinks::util::retries: Internal log [Retrying after response.] has been rate limited 33 times.
2023-05-02T05:33:31.622918Z  WARN sink{component_kind="sink" component_id=test_sink component_type=http component_name=test_sink}:request{request_id=284}: vector::sinks::util::retries: Retrying after response. reason=too many requests internal_log_rate_limit=true

Example Data

No response

Additional Context

For MRE: I created an http server for http sink with simple logic: when concurrent requests < 20 then sleeps for 40 seconds and returns success, otherwise returns 429 immediately. The content of the kafka topic is not important, just need enough amount. Vector running in k8s.

References

No response

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 24 (20 by maintainers)

Most upvoted comments

Hi @Ilmarii,

I believe this should be resolved in the latest version (v0.34.0). We fixed a memory leak in the http sink (https://github.com/vectordotdev/vector/pull/18637) that would be triggered when the downstream service returns 429 and also refactored the Kafka source to better handle acknowledgements (https://github.com/vectordotdev/vector/pull/17497). Let us know if you still experience issues after upgrading.

Well, with acknowledges disabled there is no memory leak 😃

@spencergilbert I built and tested vector from spencer/improve-selects and the memory leak still exists.

Fairly certain I’ve solved this improved this - confirming locally, and should have a PR to close it later today if testing goes well.

@Ilmarii would you be able to run a nightly version to see if my changes help enough in your environment