node-problem-detector: Reason change once and it stay in wrong state with custom plugins

Hello,

I found something again I think but maybe I follwed wrong way. I’m using a custom config based on the ntp example:

{
    "plugin": "custom",
    "pluginConfig": {
        "invoke_interval": "30s",
        "timeout": "5s",
        "max_output_length": 80,
        "concurrency": 3
    },
    "source": "ntp-custom-plugin-monitor",
    "conditions": [
        {
            "type": "CustomProblem",
            "reason": "CustomIsUp",
            "message": "Status of the custom service"
        }
    ],
    "rules": [
        {
            "type": "permanent",
            "condition": "CustomProblem",
            "reason": "CustomIsDown",
            "path": "/usr/bin/custom.sh",
            "timeout": "3s"
        }
    ]
}

The /usr/bin/custom.sh script is very simple: exit with 0 or 1.

So when the node problem detector start it set the condition:

Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  CustomProblem    False   Wed, 05 Sep 2018 14:40:26 +0200   Wed, 05 Sep 2018 14:40:25 +0200   CustomIsUp                   Status of the custom service
  OutOfDisk        False   Wed, 05 Sep 2018 14:40:23 +0200   Thu, 30 Aug 2018 17:16:47 +0200   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure   False   Wed, 05 Sep 2018 14:40:23 +0200   Thu, 30 Aug 2018 17:16:47 +0200   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 05 Sep 2018 14:40:23 +0200   Wed, 05 Sep 2018 13:17:06 +0200   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 05 Sep 2018 14:40:23 +0200   Thu, 30 Aug 2018 17:16:47 +0200   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 05 Sep 2018 14:40:23 +0200   Thu, 30 Aug 2018 17:35:04 +0200   KubeletReady                 kubelet is posting ready status

After it run the script (what returned with 0 in this case) the Status stay false but the Reason field changed to what I set in the rule section:

Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  CustomProblem    False   Wed, 05 Sep 2018 14:41:56 +0200   Wed, 05 Sep 2018 14:40:55 +0200   CustomIsDown                 Status of the custom service
  OutOfDisk        False   Wed, 05 Sep 2018 14:42:23 +0200   Thu, 30 Aug 2018 17:16:47 +0200   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure   False   Wed, 05 Sep 2018 14:42:23 +0200   Thu, 30 Aug 2018 17:16:47 +0200   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 05 Sep 2018 14:42:23 +0200   Wed, 05 Sep 2018 13:17:06 +0200   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 05 Sep 2018 14:42:23 +0200   Thu, 30 Aug 2018 17:16:47 +0200   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 05 Sep 2018 14:42:23 +0200   Thu, 30 Aug 2018 17:35:04 +0200   KubeletReady                 kubelet is posting ready status

So ok, in the next run the script exited with 1. The Status is True, and the Reason still same (this is what I set under the rule):

Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  CustomProblem    True    Wed, 05 Sep 2018 14:43:56 +0200   Wed, 05 Sep 2018 14:43:55 +0200   CustomIsDown                 Status of the custom service
  OutOfDisk        False   Wed, 05 Sep 2018 14:43:53 +0200   Thu, 30 Aug 2018 17:16:47 +0200   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure   False   Wed, 05 Sep 2018 14:43:53 +0200   Thu, 30 Aug 2018 17:16:47 +0200   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 05 Sep 2018 14:43:53 +0200   Wed, 05 Sep 2018 13:17:06 +0200   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 05 Sep 2018 14:43:53 +0200   Thu, 30 Aug 2018 17:16:47 +0200   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 05 Sep 2018 14:43:53 +0200   Thu, 30 Aug 2018 17:35:04 +0200   KubeletReady                 kubelet is posting ready status

In the next round the script returned with 0 again and Status changed back to false but the Reason didn’t change:

Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  CustomProblem    False   Wed, 05 Sep 2018 14:44:26 +0200   Wed, 05 Sep 2018 14:44:25 +0200   CustomIsDown                 Status of the custom service
  OutOfDisk        False   Wed, 05 Sep 2018 14:44:23 +0200   Thu, 30 Aug 2018 17:16:47 +0200   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure   False   Wed, 05 Sep 2018 14:44:23 +0200   Thu, 30 Aug 2018 17:16:47 +0200   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 05 Sep 2018 14:44:23 +0200   Wed, 05 Sep 2018 13:17:06 +0200   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 05 Sep 2018 14:44:23 +0200   Thu, 30 Aug 2018 17:16:47 +0200   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 05 Sep 2018 14:44:23 +0200   Thu, 30 Aug 2018 17:35:04 +0200   KubeletReady                 kubelet is posting ready status

As I see you overwrite the condition’s rule and maybe the original condition lost and the node problem detector never can’t set it again. https://github.com/kubernetes/node-problem-detector/blob/master/pkg/custompluginmonitor/custom_plugin_monitor.go#L140 But maybe I missed something.

Thank you!

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 17 (9 by maintainers)

Most upvoted comments

The fix is now included in NPD v0.6.6. @AlexShemeshWix and @MaksymTrykur, can you double check if your problem is fixed?