kubernetes: [Probes] Incorrect application of the initialDelaySeconds
Hello,
We are experiencing an issue with pod probes where they are not ran in sync with the configuration, which let suggests that the initialDelaySeconds
is not correctly applied (or some other random delay is introduced).
We have a deployment that contains a livenessProbe
and a readinessProbe
configured as followed:
livenessProbe:
exec:
command: [ "/opt/prime/bin/prime_health_checker", "control_status" ]
initialDelaySeconds: 30
periodSeconds: 30
timeoutSeconds: 10
readinessProbe:
exec:
command: [ "/opt/prime/bin/prime_health_checker", "rabbit_riak_connectivity", "log_blocking_fix" ]
initialDelaySeconds: 15
periodSeconds: 30
timeoutSeconds: 10
According to the configuration both probes should be running 15 seconds apart from one another, but this is not the case. The time difference between pods are actually variable and random for each pods and replicas.
Here are some logs from Kubelet, showing only 6 seconds delay between the two probes:
Nov 13 13:44:14 ip-10-213-59-169 kubelet[17547]: I1113 13:44:14.954625 17547 prober.go:162] Exec-Probe Pod: &Pod{ObjectMeta:{taskman-sandbox-0 taskman-sandbox- sandbox /api/v1/namespaces/sandbox/pods/taskman-sandbox-0 0615148c-1210-4cb5-850b-52992f9a7ac9 46134867 0 2020-11-11 04:51:13 +0000 UTC <nil> <nil> map[controller-revision-hash:taskman-sandbox-755dc74dd8 pypestream.com/app:taskman pypestream.com/component:taskman pypestream.com/env:sandbox pypestream.com/group:prime pypestream.com/namespace:sandbox spotinst.io/restrict-scale-down:true statefulset.kubernetes.io/pod-name:taskman-sandbox-0] map[cni.projectcalico.org/podIP:100.119.77.208/32 cni.projectcalico.org/podIPs:100.119.77.208/32 fluentbit.io/parser:prime_log kubernetes.io/config.seen:2020-11-13T10:26:49.001091487Z kubernetes.io/config.source:api] [{apps/v1 StatefulSet taskman-sandbox bfb661c9-4ebc-4c6f-9c2f-98deaec09e56 0xc0005e7c16 0xc0005e7c17}] [] []},Spec:PodSpec{Volumes:[]Volume{Volume{Name:taskman-keys,VolumeSource:VolumeSource{HostPath:nil,EmptyDir:nil,GCEPersistentDisk:nil,AWSElasticBlockStore:nil,GitRepo:nil,Secret:&SecretVolumeSource{SecretName:taskman-keys,Items:[]KeyToPath{},DefaultMode:*420,Optional:nil,},NFS:nil,ISCSI:nil,Glusterfs:nil,PersistentVolumeClaim:nil,RBD:nil,FlexVolume:nil,Cinder:nil,CephFS:nil,Flocker:nil,DownwardAPI:nil,FC:nil,AzureFile:nil,ConfigMap:nil,VsphereVolume:nil,Quobyte:nil,AzureDisk:nil,PhotonPersistentDisk:nil,PortworxVolume:nil,ScaleIO:nil,Projected:nil,StorageOS:nil,CSI:nil,},},Volume{Name:default-token-rcwqc,VolumeSource:VolumeSource{HostPath:nil,EmptyDir:nil,GCEPersistentDisk:nil,AWSElasticBlockStore:nil,GitRepo:nil,Secret:&SecretVolumeSource{SecretName:default-token-rcwqc,Items:[]KeyToPath{},DefaultMode:*420,Optional:nil,},NFS:nil,ISCSI:nil,Glusterfs:nil,PersistentVolumeClaim:nil,RBD:nil,FlexVolume:nil,Cinder:nil,CephFS:nil,Flocker:nil,DownwardAPI:nil,FC:nil,AzureFile:nil,ConfigMap:nil,VsphereVolume:nil,Quobyte:nil,AzureDisk:nil,PhotonPersistentDisk:nil,PortworxVolume:nil,ScaleIO:nil,Projected:nil,StorageOS:nil,CSI:nil,},},},Containers:[]Container{Contai
Nov 13 13:44:14 ip-10-213-59-169 kubelet[17547]: ,Optional:nil,},},EnvFromSource{Prefix:,ConfigMapRef:nil,SecretRef:&SecretEnvSource{LocalObjectReference:LocalObjectReference{Name:common-aws-s3-presentation-key,},Optional:nil,},},EnvFromSource{Prefix:,ConfigMapRef:nil,SecretRef:&SecretEnvSource{LocalObjectReference:LocalObjectReference{Name:common-freshchat-token,},Optional:nil,},},EnvFromSource{Prefix:,ConfigMapRef:&ConfigMapEnvSource{LocalObjectReference:LocalObjectReference{Name:taskman-env,},Optional:nil,},SecretRef:nil,},EnvFromSource{Prefix:,ConfigMapRef:nil,SecretRef:&SecretEnvSource{LocalObjectReference:LocalObjectReference{Name:taskman-aws-keys,},Optional:nil,},},EnvFromSource{Prefix:,ConfigMapRef:nil,SecretRef:&SecretEnvSource{LocalObjectReference:LocalObjectReference{Name:taskman-rabbitmq-user,},Optional:nil,},},},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,},},RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:ip-10-213-59-169.us-west-2.compute.internal,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[]Sysctl{},WindowsOptions:nil,},ImagePullSecrets:[]LocalObjectReference{},Hostname:taskman-sandbox-0,Subdomain:taskman-sandbox,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[]Container{},AutomountServiceAccountToken:nil,Tolerations:[]Toleration{Toleration{Key:node.kubernetes.io/not-ready,Operator:Exists,Value:,Effect:NoExecute,TolerationSeconds:*300,},Toleration{Key:node.kubernetes.io/unreachable,Operator:Exists,Value:,Effect:NoExecute,TolerationSeconds:*300,},},HostAliases:[]HostAlias{},PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[]PodReadinessGate{},RuntimeClassName:nil,EnableServiceLinks:*true,PreemptionPolicy:nil,Overhead:ResourceList{},TopologySpreadConstraints:[]Topolo
Nov 13 13:44:17 ip-10-213-59-169 kubelet[17547]: I1113 13:44:17.678245 17547 prober.go:129] Liveness probe for "taskman-sandbox-0_sandbox(0615148c-1210-4cb5-850b-52992f9a7ac9):taskman-sandbox" succeeded
Nov 13 13:44:20 ip-10-213-59-169 kubelet[17547]: I1113 13:44:20.547366 17547 prober.go:162] Exec-Probe Pod: &Pod{ObjectMeta:{taskman-sandbox-0 taskman-sandbox- sandbox /api/v1/namespaces/sandbox/pods/taskman-sandbox-0 0615148c-1210-4cb5-850b-52992f9a7ac9 46134867 0 2020-11-11 04:51:13 +0000 UTC <nil> <nil> map[controller-revision-hash:taskman-sandbox-755dc74dd8 pypestream.com/app:taskman pypestream.com/component:taskman pypestream.com/env:sandbox pypestream.com/group:prime pypestream.com/namespace:sandbox spotinst.io/restrict-scale-down:true statefulset.kubernetes.io/pod-name:taskman-sandbox-0] map[cni.projectcalico.org/podIP:100.119.77.208/32 cni.projectcalico.org/podIPs:100.119.77.208/32 fluentbit.io/parser:prime_log kubernetes.io/config.seen:2020-11-13T10:26:49.001091487Z kubernetes.io/config.source:api] [{apps/v1 StatefulSet taskman-sandbox bfb661c9-4ebc-4c6f-9c2f-98deaec09e56 0xc0005e7c16 0xc0005e7c17}] [] []},Spec:PodSpec{Volumes:[]Volume{Volume{Name:taskman-keys,VolumeSource:VolumeSource{HostPath:nil,EmptyDir:nil,GCEPersistentDisk:nil,AWSElasticBlockStore:nil,GitRepo:nil,Secret:&SecretVolumeSource{SecretName:taskman-keys,Items:[]KeyToPath{},DefaultMode:*420,Optional:nil,},NFS:nil,ISCSI:nil,Glusterfs:nil,PersistentVolumeClaim:nil,RBD:nil,FlexVolume:nil,Cinder:nil,CephFS:nil,Flocker:nil,DownwardAPI:nil,FC:nil,AzureFile:nil,ConfigMap:nil,VsphereVolume:nil,Quobyte:nil,AzureDisk:nil,PhotonPersistentDisk:nil,PortworxVolume:nil,ScaleIO:nil,Projected:nil,StorageOS:nil,CSI:nil,},},Volume{Name:default-token-rcwqc,VolumeSource:VolumeSource{HostPath:nil,EmptyDir:nil,GCEPersistentDisk:nil,AWSElasticBlockStore:nil,GitRepo:nil,Secret:&SecretVolumeSource{SecretName:default-token-rcwqc,Items:[]KeyToPath{},DefaultMode:*420,Optional:nil,},NFS:nil,ISCSI:nil,Glusterfs:nil,PersistentVolumeClaim:nil,RBD:nil,FlexVolume:nil,Cinder:nil,CephFS:nil,Flocker:nil,DownwardAPI:nil,FC:nil,AzureFile:nil,ConfigMap:nil,VsphereVolume:nil,Quobyte:nil,AzureDisk:nil,PhotonPersistentDisk:nil,PortworxVolume:nil,ScaleIO:nil,Projected:nil,StorageOS:nil,CSI:nil,},},},Containers:[]Container{Contai
Nov 13 13:44:20 ip-10-213-59-169 kubelet[17547]: ,Optional:nil,},},EnvFromSource{Prefix:,ConfigMapRef:nil,SecretRef:&SecretEnvSource{LocalObjectReference:LocalObjectReference{Name:common-aws-s3-presentation-key,},Optional:nil,},},EnvFromSource{Prefix:,ConfigMapRef:nil,SecretRef:&SecretEnvSource{LocalObjectReference:LocalObjectReference{Name:common-freshchat-token,},Optional:nil,},},EnvFromSource{Prefix:,ConfigMapRef:&ConfigMapEnvSource{LocalObjectReference:LocalObjectReference{Name:taskman-env,},Optional:nil,},SecretRef:nil,},EnvFromSource{Prefix:,ConfigMapRef:nil,SecretRef:&SecretEnvSource{LocalObjectReference:LocalObjectReference{Name:taskman-aws-keys,},Optional:nil,},},EnvFromSource{Prefix:,ConfigMapRef:nil,SecretRef:&SecretEnvSource{LocalObjectReference:LocalObjectReference{Name:taskman-rabbitmq-user,},Optional:nil,},},},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,},},RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:ip-10-213-59-169.us-west-2.compute.internal,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[]Sysctl{},WindowsOptions:nil,},ImagePullSecrets:[]LocalObjectReference{},Hostname:taskman-sandbox-0,Subdomain:taskman-sandbox,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[]Container{},AutomountServiceAccountToken:nil,Tolerations:[]Toleration{Toleration{Key:node.kubernetes.io/not-ready,Operator:Exists,Value:,Effect:NoExecute,TolerationSeconds:*300,},Toleration{Key:node.kubernetes.io/unreachable,Operator:Exists,Value:,Effect:NoExecute,TolerationSeconds:*300,},},HostAliases:[]HostAlias{},PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[]PodReadinessGate{},RuntimeClassName:nil,EnableServiceLinks:*true,PreemptionPolicy:nil,Overhead:ResourceList{},TopologySpreadConstraints:[]Topolo
Nov 13 13:44:23 ip-10-213-59-169 kubelet[17547]: I1113 13:44:23.377094 17547 prober.go:129] Readiness probe for "taskman-sandbox-0_sandbox(0615148c-1210-4cb5-850b-52992f9a7ac9):taskman-sandbox" succeeded
Here is what we noticed:
- The time difference stay unchanged if the pod is restarted (by k8s, after some liveness probes failed).
- If the pod is re-created, timings changes.
- Timings are random and “unique” per replica.
- This issue doesn’t come from a time drift as it’s observable as soon as the pod is created.
Environment:
- K8s 1.16.10
- Kops 1.16.3
- Debian Stretch
- Linux 4.9.0-13-amd64 # 1 SMP Debian 4.9.228-1 (2020-07-05) x86_64 GNU/Linux
Thanks for the help and the support
About this issue
- Original URL
- State: open
- Created 4 years ago
- Comments: 32 (17 by maintainers)
I believe that this is expected behavior. I’ll report my findings and we’ll see if someone who knows this better then me can confirm. The
InitialDelaySeconds
prevents a probe from running before the delay has passed always. So I would imagine that you’re at least seeing that the probes aren’t occurring before that initial delay.https://github.com/kubernetes/kubernetes/blob/b2ecd1b3a3192fbbe2b9e348e095326f51dc43dd/pkg/kubelet/prober/worker.go#L226-L228
What you’re probably seeing is a product of this
https://github.com/kubernetes/kubernetes/blob/b2ecd1b3a3192fbbe2b9e348e095326f51dc43dd/pkg/kubelet/prober/worker.go#L128-L133
which indicates that the actual first probe time can be expressed as anytime after the
InitialDelaySeconds
and beforeInitialDelaySeconds
+PeriodSeconds
. Given the code comment, the idea behind this “random timing” is the avoid a flood of probes for a kubelet that has just started/restarted, we’re basically providing some “jitter” here.Again, I didn’t write this but based on your description and what I’m seeing in the code, I believe what you’re seeing is expected behavior.
Also I believe this falls under sig/node.
I see 3 problems with predictible probe invocation.
initialDelaySeconds
doesn’t work as the spec says: Number of seconds after the container has started before probes are initiated. Intuitively, this simply means the first invocation of the probe iscontainerStatuses.state.running.startedAt + initialDelaySeconds
. The implementation is far from that, as https://github.com/kubernetes/kubernetes/issues/96614#issuecomment-731654129.The jitters given to the probes in the same container are different, as https://github.com/kubernetes/kubernetes/issues/96614#issuecomment-775925747. https://github.com/kubernetes/kubernetes/pull/102064 is trying to solve it.
Since 1.21, the jitters are only added when kubelet is recently started/restarted, as https://github.com/kubernetes/kubernetes/issues/96614#issuecomment-1236258089