gnmic: I think Prometheus output have a memory leak that crashes the server it runs on.

We have noticed a strange behavior when we use any of the prometheus outputs that the ram usage of gnmic just keep going up and up until the server starts to swap and become unresponsive.

It works fine if we have around 20-30 devices but if we try to do 400+ devices we start getting problem. And thats just half of the devices we have. I tried strings-as-labels both false and true. For the scrape based prometheus i tried lower to cache timer does not work. For the remote write i played with buffer size, interval, max-time-series-per-write nothing helps. I do not have any event-processors.

If i use Kafka or file outputs instead i do not see a continuously increase in RAM usage.

CPU usage is ~25% 5min load average. Metrics of RAM usage and below is a disk usage where you can see when we changed to file output that ram does not increase. In the picture you can see at ~17:20 ram usage whent up a little that was when we added Kafka output.
image

insecure: true
encoding: PROTO
log: true
gzip: false
timeout: 60s
debug: false

api-server:
  address: :7890
  skip-verify: true
  enable-metrics: true


loader:
  type: http
  timeout: 60s
  enable-metrics: true
  tls:
    skip-verify: true
  url: "https://gnmic-inventory/gnmic/"


subscriptions:
  juniper:
    stream-subscriptions:
      - paths:
        - "/junos/system/linecard/interface"
        - "/interfaces/interface/ethernet/state/counters"
        stream-mode: sample
        sample-interval: 30s
      - paths:
        - "/interfaces/interface/state"
        - "/interfaces/interface/ethernet/state"
        stream-mode: sample
        sample-interval: 720s
      - paths:
        - "/interfaces/interface/state/oper-status"
        - "/interfaces/interface/state/admin-status"
        stream-mode: on-change
      - paths:
        - "/system/alarms"
        stream-mode: on-change
      - paths:
        - "/components/component"
        stream-mode: sample
        sample-interval: 60s
    encoding: proto
    mode: stream

outputs:
  prom-remote:
    type: prometheus_write
    url: https://collector-cortex/api/prom/push
    authorization:
      type: Bearer
      credentials: SECRET
    metric-prefix: gnmic
    buffer-size: 10000
    max-time-series-per-write: 1500
    interval: 10s
    max-retries: 0
    strings-as-labels: true
    enable-metrics: false
    tls:
    skip-verify: true
    debug: false

  file:
    type: file
    filename: /tmp/metrics.json
    format: event

  kafka-output:
    type: kafka
    name: gnmic
    address: REMOVED
    topic: metrics.juniper
    sasl:
      user: gnmic
      password: REMOVED
      mechanism: SCRAM-SHA-512
    tls:
      skip-verify: true
    format: event
    override-timestamps: false
    num-workers: 3
    debug: false


About this issue

  • Original URL
  • State: open
  • Created 9 months ago
  • Comments: 34 (12 by maintainers)

Most upvoted comments

We hit another issue as you can see in the picture above, our interface is only a 1Gb interface. Will comeback with our final configuration when we solve that.

Our log is full of, but i believe its because of the interface.

2023/10/04 14:34:12.474788 [prometheus_write_output:prom-remote] writing expired after 15s
2023/10/04 14:34:12.474813 [prometheus_write_output:prom-remote] writing expired after 15s
2023/10/04 14:34:12.474585 [prometheus_write_output:prom-remote] writing expired after 15s
2023/10/04 14:34:12.474970 [prometheus_write_output:prom-remote] writing expired after 15s
2023/10/04 14:34:12.474980 [prometheus_write_output:prom-remote] writing expired after 15s
2023/10/04 14:34:12.475001 [prometheus_write_output:prom-remote] writing expired after 15s
2023/10/04 14:34:12.475014 [prometheus_write_output:prom-remote] writing expired after 15s

Would it be possible to get some prometheus write output metrics like there is for kafka to correlate with the grpc_client_msg_received_total

e.g number_of_prometheus_write_msgs_sent_success_total number_of_prometheus_write_msgs_sent_fail_total