gnmic: I think Prometheus output have a memory leak that crashes the server it runs on.
We have noticed a strange behavior when we use any of the prometheus outputs that the ram usage of gnmic just keep going up and up until the server starts to swap and become unresponsive.
It works fine if we have around 20-30 devices but if we try to do 400+ devices we start getting problem. And thats just half of the devices we have. I tried strings-as-labels both false and true. For the scrape based prometheus i tried lower to cache timer does not work. For the remote write i played with buffer size, interval, max-time-series-per-write nothing helps. I do not have any event-processors.
If i use Kafka or file outputs instead i do not see a continuously increase in RAM usage.
CPU usage is ~25% 5min load average.
Metrics of RAM usage and below is a disk usage where you can see when we changed to file output that ram does not increase.
In the picture you can see at ~17:20 ram usage whent up a little that was when we added Kafka output.
insecure: true
encoding: PROTO
log: true
gzip: false
timeout: 60s
debug: false
api-server:
address: :7890
skip-verify: true
enable-metrics: true
loader:
type: http
timeout: 60s
enable-metrics: true
tls:
skip-verify: true
url: "https://gnmic-inventory/gnmic/"
subscriptions:
juniper:
stream-subscriptions:
- paths:
- "/junos/system/linecard/interface"
- "/interfaces/interface/ethernet/state/counters"
stream-mode: sample
sample-interval: 30s
- paths:
- "/interfaces/interface/state"
- "/interfaces/interface/ethernet/state"
stream-mode: sample
sample-interval: 720s
- paths:
- "/interfaces/interface/state/oper-status"
- "/interfaces/interface/state/admin-status"
stream-mode: on-change
- paths:
- "/system/alarms"
stream-mode: on-change
- paths:
- "/components/component"
stream-mode: sample
sample-interval: 60s
encoding: proto
mode: stream
outputs:
prom-remote:
type: prometheus_write
url: https://collector-cortex/api/prom/push
authorization:
type: Bearer
credentials: SECRET
metric-prefix: gnmic
buffer-size: 10000
max-time-series-per-write: 1500
interval: 10s
max-retries: 0
strings-as-labels: true
enable-metrics: false
tls:
skip-verify: true
debug: false
file:
type: file
filename: /tmp/metrics.json
format: event
kafka-output:
type: kafka
name: gnmic
address: REMOVED
topic: metrics.juniper
sasl:
user: gnmic
password: REMOVED
mechanism: SCRAM-SHA-512
tls:
skip-verify: true
format: event
override-timestamps: false
num-workers: 3
debug: false
About this issue
- Original URL
- State: open
- Created 9 months ago
- Comments: 34 (12 by maintainers)
We hit another issue as you can see in the picture above, our interface is only a 1Gb interface. Will comeback with our final configuration when we solve that.
Our log is full of, but i believe its because of the interface.
Would it be possible to get some prometheus write output metrics like there is for kafka to correlate with the grpc_client_msg_received_total
e.g number_of_prometheus_write_msgs_sent_success_total number_of_prometheus_write_msgs_sent_fail_total