integrations-core: kubernetes.pods.running reporting incorrectly

Output of the info page


==============
Agent (v6.4.2)
==============

  Status date: 2018-08-24 00:05:55.602398 UTC
  Pid: 352
  Python Version: 2.7.15
  Logs:
  Check Runners: 2
  Log Level: WARNING

    kubernetes_apiserver
    --------------------
      Total Runs: 53293
      Metric Samples: 0, Total: 0
      Events: 0, Total: 0
      Service Checks: 0, Total: 0
      Average Execution Time : 4ms


(a ton of unrelated and possibly sensitive stuff removed)

Additional environment details (Operating System, Cloud provider, etc): GKE - kubernetes 1.10

Steps to reproduce the issue: Have a k8s cluster monitored by datadog where at least one pod is in a failed state (or anything that’s not running)

Describe the results you received: The metric appears to count pods in all statuses, including failed

Describe the results you expected: Simple fix: the metric is correctly filtered to only pods where status.phase == Running

Enhancement: the metric is replaced by kubernetes.pods.count with status.phase added as a tag, allowing accurate reporting of pods in e.g. Failed state. This would enable more useful metrics and reporting.

Note that a similar metric is exposed in kubernetes_state when its configured, but that shouldn’t excuse the inaccuracy of the other one.

Additional information you deem important (e.g. issue happens only occasionally):

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 11
  • Comments: 16 (1 by maintainers)

Most upvoted comments

I can confirm that during a period in which we terminate some pods and then we create new ones (not with replicas, simply scheduling pending pods on a node which now has free resources) we see 3 times the amount of real pods. This I believe happens when we launch the new pods. In our case we launch one pods per namespace but DD report 3 for 5 minutes.

I’d also like to add that we consistently see inaccurate measurements for the pods running metrics

the numbers are off by 100% during scaling periods and can take up to 10 minutes to stabilize. Turning off interpolation in the metric graphs shows a sawtooth measurement.

@ahmed-mez this issue is not resolved by setting sum by, as I commented last year.

We face this issue too. kubernetes.pods.running shows only a single pod most of the time. Sometime it changes to a floating-point number (up to 1.4) even when there are definitely several pods running.