kubernetes: [Probes] Incorrect application of the initialDelaySeconds

Hello,

We are experiencing an issue with pod probes where they are not ran in sync with the configuration, which let suggests that the initialDelaySeconds is not correctly applied (or some other random delay is introduced).

We have a deployment that contains a livenessProbe and a readinessProbe configured as followed:

        livenessProbe:
          exec:
            command: [ "/opt/prime/bin/prime_health_checker", "control_status" ]
          initialDelaySeconds: 30
          periodSeconds: 30
          timeoutSeconds: 10
        readinessProbe:
          exec:
            command: [ "/opt/prime/bin/prime_health_checker", "rabbit_riak_connectivity", "log_blocking_fix" ]
          initialDelaySeconds: 15
          periodSeconds: 30
          timeoutSeconds: 10

According to the configuration both probes should be running 15 seconds apart from one another, but this is not the case. The time difference between pods are actually variable and random for each pods and replicas.

Here are some logs from Kubelet, showing only 6 seconds delay between the two probes:

Nov 13 13:44:14 ip-10-213-59-169 kubelet[17547]: I1113 13:44:14.954625   17547 prober.go:162] Exec-Probe Pod: &Pod{ObjectMeta:{taskman-sandbox-0 taskman-sandbox- sandbox /api/v1/namespaces/sandbox/pods/taskman-sandbox-0 0615148c-1210-4cb5-850b-52992f9a7ac9 46134867 0 2020-11-11 04:51:13 +0000 UTC <nil> <nil> map[controller-revision-hash:taskman-sandbox-755dc74dd8 pypestream.com/app:taskman pypestream.com/component:taskman pypestream.com/env:sandbox pypestream.com/group:prime pypestream.com/namespace:sandbox spotinst.io/restrict-scale-down:true statefulset.kubernetes.io/pod-name:taskman-sandbox-0] map[cni.projectcalico.org/podIP:100.119.77.208/32 cni.projectcalico.org/podIPs:100.119.77.208/32 fluentbit.io/parser:prime_log kubernetes.io/config.seen:2020-11-13T10:26:49.001091487Z kubernetes.io/config.source:api] [{apps/v1 StatefulSet taskman-sandbox bfb661c9-4ebc-4c6f-9c2f-98deaec09e56 0xc0005e7c16 0xc0005e7c17}] []  []},Spec:PodSpec{Volumes:[]Volume{Volume{Name:taskman-keys,VolumeSource:VolumeSource{HostPath:nil,EmptyDir:nil,GCEPersistentDisk:nil,AWSElasticBlockStore:nil,GitRepo:nil,Secret:&SecretVolumeSource{SecretName:taskman-keys,Items:[]KeyToPath{},DefaultMode:*420,Optional:nil,},NFS:nil,ISCSI:nil,Glusterfs:nil,PersistentVolumeClaim:nil,RBD:nil,FlexVolume:nil,Cinder:nil,CephFS:nil,Flocker:nil,DownwardAPI:nil,FC:nil,AzureFile:nil,ConfigMap:nil,VsphereVolume:nil,Quobyte:nil,AzureDisk:nil,PhotonPersistentDisk:nil,PortworxVolume:nil,ScaleIO:nil,Projected:nil,StorageOS:nil,CSI:nil,},},Volume{Name:default-token-rcwqc,VolumeSource:VolumeSource{HostPath:nil,EmptyDir:nil,GCEPersistentDisk:nil,AWSElasticBlockStore:nil,GitRepo:nil,Secret:&SecretVolumeSource{SecretName:default-token-rcwqc,Items:[]KeyToPath{},DefaultMode:*420,Optional:nil,},NFS:nil,ISCSI:nil,Glusterfs:nil,PersistentVolumeClaim:nil,RBD:nil,FlexVolume:nil,Cinder:nil,CephFS:nil,Flocker:nil,DownwardAPI:nil,FC:nil,AzureFile:nil,ConfigMap:nil,VsphereVolume:nil,Quobyte:nil,AzureDisk:nil,PhotonPersistentDisk:nil,PortworxVolume:nil,ScaleIO:nil,Projected:nil,StorageOS:nil,CSI:nil,},},},Containers:[]Container{Contai
Nov 13 13:44:14 ip-10-213-59-169 kubelet[17547]: ,Optional:nil,},},EnvFromSource{Prefix:,ConfigMapRef:nil,SecretRef:&SecretEnvSource{LocalObjectReference:LocalObjectReference{Name:common-aws-s3-presentation-key,},Optional:nil,},},EnvFromSource{Prefix:,ConfigMapRef:nil,SecretRef:&SecretEnvSource{LocalObjectReference:LocalObjectReference{Name:common-freshchat-token,},Optional:nil,},},EnvFromSource{Prefix:,ConfigMapRef:&ConfigMapEnvSource{LocalObjectReference:LocalObjectReference{Name:taskman-env,},Optional:nil,},SecretRef:nil,},EnvFromSource{Prefix:,ConfigMapRef:nil,SecretRef:&SecretEnvSource{LocalObjectReference:LocalObjectReference{Name:taskman-aws-keys,},Optional:nil,},},EnvFromSource{Prefix:,ConfigMapRef:nil,SecretRef:&SecretEnvSource{LocalObjectReference:LocalObjectReference{Name:taskman-rabbitmq-user,},Optional:nil,},},},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,},},RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:ip-10-213-59-169.us-west-2.compute.internal,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[]Sysctl{},WindowsOptions:nil,},ImagePullSecrets:[]LocalObjectReference{},Hostname:taskman-sandbox-0,Subdomain:taskman-sandbox,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[]Container{},AutomountServiceAccountToken:nil,Tolerations:[]Toleration{Toleration{Key:node.kubernetes.io/not-ready,Operator:Exists,Value:,Effect:NoExecute,TolerationSeconds:*300,},Toleration{Key:node.kubernetes.io/unreachable,Operator:Exists,Value:,Effect:NoExecute,TolerationSeconds:*300,},},HostAliases:[]HostAlias{},PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[]PodReadinessGate{},RuntimeClassName:nil,EnableServiceLinks:*true,PreemptionPolicy:nil,Overhead:ResourceList{},TopologySpreadConstraints:[]Topolo
Nov 13 13:44:17 ip-10-213-59-169 kubelet[17547]: I1113 13:44:17.678245   17547 prober.go:129] Liveness probe for "taskman-sandbox-0_sandbox(0615148c-1210-4cb5-850b-52992f9a7ac9):taskman-sandbox" succeeded
Nov 13 13:44:20 ip-10-213-59-169 kubelet[17547]: I1113 13:44:20.547366   17547 prober.go:162] Exec-Probe Pod: &Pod{ObjectMeta:{taskman-sandbox-0 taskman-sandbox- sandbox /api/v1/namespaces/sandbox/pods/taskman-sandbox-0 0615148c-1210-4cb5-850b-52992f9a7ac9 46134867 0 2020-11-11 04:51:13 +0000 UTC <nil> <nil> map[controller-revision-hash:taskman-sandbox-755dc74dd8 pypestream.com/app:taskman pypestream.com/component:taskman pypestream.com/env:sandbox pypestream.com/group:prime pypestream.com/namespace:sandbox spotinst.io/restrict-scale-down:true statefulset.kubernetes.io/pod-name:taskman-sandbox-0] map[cni.projectcalico.org/podIP:100.119.77.208/32 cni.projectcalico.org/podIPs:100.119.77.208/32 fluentbit.io/parser:prime_log kubernetes.io/config.seen:2020-11-13T10:26:49.001091487Z kubernetes.io/config.source:api] [{apps/v1 StatefulSet taskman-sandbox bfb661c9-4ebc-4c6f-9c2f-98deaec09e56 0xc0005e7c16 0xc0005e7c17}] []  []},Spec:PodSpec{Volumes:[]Volume{Volume{Name:taskman-keys,VolumeSource:VolumeSource{HostPath:nil,EmptyDir:nil,GCEPersistentDisk:nil,AWSElasticBlockStore:nil,GitRepo:nil,Secret:&SecretVolumeSource{SecretName:taskman-keys,Items:[]KeyToPath{},DefaultMode:*420,Optional:nil,},NFS:nil,ISCSI:nil,Glusterfs:nil,PersistentVolumeClaim:nil,RBD:nil,FlexVolume:nil,Cinder:nil,CephFS:nil,Flocker:nil,DownwardAPI:nil,FC:nil,AzureFile:nil,ConfigMap:nil,VsphereVolume:nil,Quobyte:nil,AzureDisk:nil,PhotonPersistentDisk:nil,PortworxVolume:nil,ScaleIO:nil,Projected:nil,StorageOS:nil,CSI:nil,},},Volume{Name:default-token-rcwqc,VolumeSource:VolumeSource{HostPath:nil,EmptyDir:nil,GCEPersistentDisk:nil,AWSElasticBlockStore:nil,GitRepo:nil,Secret:&SecretVolumeSource{SecretName:default-token-rcwqc,Items:[]KeyToPath{},DefaultMode:*420,Optional:nil,},NFS:nil,ISCSI:nil,Glusterfs:nil,PersistentVolumeClaim:nil,RBD:nil,FlexVolume:nil,Cinder:nil,CephFS:nil,Flocker:nil,DownwardAPI:nil,FC:nil,AzureFile:nil,ConfigMap:nil,VsphereVolume:nil,Quobyte:nil,AzureDisk:nil,PhotonPersistentDisk:nil,PortworxVolume:nil,ScaleIO:nil,Projected:nil,StorageOS:nil,CSI:nil,},},},Containers:[]Container{Contai
Nov 13 13:44:20 ip-10-213-59-169 kubelet[17547]: ,Optional:nil,},},EnvFromSource{Prefix:,ConfigMapRef:nil,SecretRef:&SecretEnvSource{LocalObjectReference:LocalObjectReference{Name:common-aws-s3-presentation-key,},Optional:nil,},},EnvFromSource{Prefix:,ConfigMapRef:nil,SecretRef:&SecretEnvSource{LocalObjectReference:LocalObjectReference{Name:common-freshchat-token,},Optional:nil,},},EnvFromSource{Prefix:,ConfigMapRef:&ConfigMapEnvSource{LocalObjectReference:LocalObjectReference{Name:taskman-env,},Optional:nil,},SecretRef:nil,},EnvFromSource{Prefix:,ConfigMapRef:nil,SecretRef:&SecretEnvSource{LocalObjectReference:LocalObjectReference{Name:taskman-aws-keys,},Optional:nil,},},EnvFromSource{Prefix:,ConfigMapRef:nil,SecretRef:&SecretEnvSource{LocalObjectReference:LocalObjectReference{Name:taskman-rabbitmq-user,},Optional:nil,},},},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,},},RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:ip-10-213-59-169.us-west-2.compute.internal,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[]Sysctl{},WindowsOptions:nil,},ImagePullSecrets:[]LocalObjectReference{},Hostname:taskman-sandbox-0,Subdomain:taskman-sandbox,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[]Container{},AutomountServiceAccountToken:nil,Tolerations:[]Toleration{Toleration{Key:node.kubernetes.io/not-ready,Operator:Exists,Value:,Effect:NoExecute,TolerationSeconds:*300,},Toleration{Key:node.kubernetes.io/unreachable,Operator:Exists,Value:,Effect:NoExecute,TolerationSeconds:*300,},},HostAliases:[]HostAlias{},PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[]PodReadinessGate{},RuntimeClassName:nil,EnableServiceLinks:*true,PreemptionPolicy:nil,Overhead:ResourceList{},TopologySpreadConstraints:[]Topolo
Nov 13 13:44:23 ip-10-213-59-169 kubelet[17547]: I1113 13:44:23.377094   17547 prober.go:129] Readiness probe for "taskman-sandbox-0_sandbox(0615148c-1210-4cb5-850b-52992f9a7ac9):taskman-sandbox" succeeded

Here is what we noticed:

  • The time difference stay unchanged if the pod is restarted (by k8s, after some liveness probes failed).
  • If the pod is re-created, timings changes.
  • Timings are random and “unique” per replica.
  • This issue doesn’t come from a time drift as it’s observable as soon as the pod is created.

Environment:

  • K8s 1.16.10
  • Kops 1.16.3
  • Debian Stretch
  • Linux 4.9.0-13-amd64 # 1 SMP Debian 4.9.228-1 (2020-07-05) x86_64 GNU/Linux

Thanks for the help and the support

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Comments: 32 (17 by maintainers)

Most upvoted comments

I believe that this is expected behavior. I’ll report my findings and we’ll see if someone who knows this better then me can confirm. The InitialDelaySeconds prevents a probe from running before the delay has passed always. So I would imagine that you’re at least seeing that the probes aren’t occurring before that initial delay.

https://github.com/kubernetes/kubernetes/blob/b2ecd1b3a3192fbbe2b9e348e095326f51dc43dd/pkg/kubelet/prober/worker.go#L226-L228

What you’re probably seeing is a product of this

https://github.com/kubernetes/kubernetes/blob/b2ecd1b3a3192fbbe2b9e348e095326f51dc43dd/pkg/kubelet/prober/worker.go#L128-L133

which indicates that the actual first probe time can be expressed as anytime after the InitialDelaySeconds and before InitialDelaySeconds + PeriodSeconds. Given the code comment, the idea behind this “random timing” is the avoid a flood of probes for a kubelet that has just started/restarted, we’re basically providing some “jitter” here.

Again, I didn’t write this but based on your description and what I’m seeing in the code, I believe what you’re seeing is expected behavior.

Also I believe this falls under sig/node.

I see 3 problems with predictible probe invocation.

  1. initialDelaySeconds doesn’t work as the spec says: Number of seconds after the container has started before probes are initiated. Intuitively, this simply means the first invocation of the probe is containerStatuses.state.running.startedAt + initialDelaySeconds. The implementation is far from that, as https://github.com/kubernetes/kubernetes/issues/96614#issuecomment-731654129.

  2. The jitters given to the probes in the same container are different, as https://github.com/kubernetes/kubernetes/issues/96614#issuecomment-775925747. https://github.com/kubernetes/kubernetes/pull/102064 is trying to solve it.

  3. Since 1.21, the jitters are only added when kubelet is recently started/restarted, as https://github.com/kubernetes/kubernetes/issues/96614#issuecomment-1236258089