serving: Service getting scaled down on Timeout before Timeout?
In what area(s)?
/area serving
What version of Knative?
0.9
Expected Behavior
Deployed a service with default revision timeout (300s)… however the service gets marked as timeout after 2m28s causing the deployment to be scaled down to 0 before the service has a chance to come up.
Actual Behavior
NAME READY STATUS RESTARTS AGE
petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 0/3 Pending 0 105s
petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 0/3 Pending 0 106s
petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 0/3 Init:0/1 0 111s
petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 0/3 Init:0/1 0 2m28s
petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 0/3 Terminating 0 2m28s
petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 0/3 Terminating 0 2m28s
Events:
$ k get events -n petrodeg-mds5-1
LAST SEEN TYPE REASON OBJECT MESSAGE
16s Normal Synced azurekeyvaultsecret/mir-int-westus2-tenant-viennadroptestkey AzureKeyVaultSecret synced successfully
24m Warning FailedScheduling pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 0/14 nodes are available: 14 node(s) didn't match node selector, 3 Insufficient memory.
24m Normal NotTriggerScaleUp pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) didn't match node selector
23m Warning FailedScheduling pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 0/15 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 14 node(s) didn't match node selector, 3 Insufficient memory.
23m Normal Scheduled pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 Successfully assigned petrodeg-mds5-1/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 to k8s-petrodeg-mds5-1-42413502-vmss000001
23m Normal Pulling pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 Pulling image "docker.io/istio/proxy_init:1.3.0"
23m Normal Pulled pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 Successfully pulled image "docker.io/istio/proxy_init:1.3.0"
23m Normal Created pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 Created container istio-init
23m Normal Started pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 Started container istio-init
25m Normal SuccessfulCreate replicaset/petrodeg-ms5-default-v9rdc-deployment-766bf565cc Created pod: petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5
23m Normal SuccessfulDelete replicaset/petrodeg-ms5-default-v9rdc-deployment-766bf565cc Deleted pod: petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5
25m Normal ScalingReplicaSet deployment/petrodeg-ms5-default-v9rdc-deployment Scaled up replica set petrodeg-ms5-default-v9rdc-deployment-766bf565cc to 1
23m Normal ScalingReplicaSet deployment/petrodeg-ms5-default-v9rdc-deployment Scaled down replica set petrodeg-ms5-default-v9rdc-deployment-766bf565cc to 0
25m Normal Updated metric/petrodeg-ms5-default-v9rdc Successfully updated metric status petrodeg-mds5-1/petrodeg-ms5-default-v9rdc
23m Normal Updated serverlessservice/petrodeg-ms5-default-v9rdc Successfully updated ServerlessService "petrodeg-mds5-1/petrodeg-ms5-default-v9rdc"
23m Warning InternalError revision/petrodeg-ms5-default-v9rdc Operation cannot be fulfilled on deployments.apps "petrodeg-ms5-default-v9rdc-deployment": the object has been modified; please apply your changes to the latest version and try again
23m Warning UpdateFailed podautoscaler/petrodeg-ms5-default-v9rdc Failed to update status for PA "petrodeg-ms5-default-v9rdc": Operation cannot be fulfilled on podautoscalers.autoscaling.internal.knative.dev "petrodeg-ms5-default-v9rdc": the object has been modified; please apply your changes to the latest version and try again
25m Normal Created configuration/petrodeg-ms5-default Created Revision "petrodeg-ms5-default-v9rdc"
23m Warning LatestCreatedFailed configuration/petrodeg-ms5-default Latest created revision "petrodeg-ms5-default-v9rdc" has failed
Revision:
$ k describe revision petrodeg-ms5-default-v9rdc -n petrodeg-mds5-1
Name: petrodeg-ms5-default-v9rdc
Namespace: petrodeg-mds5-1
Labels: serving.knative.dev/configuration=petrodeg-ms5-default
serving.knative.dev/configurationGeneration=1
serving.knative.dev/service=
serving.kubeflow.org/kfservice=petrodeg-ms5
Annotations: autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
autoscaling.knative.dev/maxScale: 1
autoscaling.knative.dev/minScale: 1
autoscaling.knative.dev/target: 1
autoscaling.knative.dev/targetBurstCapacity: 0
networking.knative.dev/customExternalIPGatewayName: petrodeg-mds5-1/petrodeg-ms5-1
API Version: serving.knative.dev/v1alpha1
Kind: Revision
Metadata:
Creation Timestamp: 2019-12-11T17:55:08Z
Generate Name: petrodeg-ms5-default-
Generation: 1
Owner References:
API Version: serving.knative.dev/v1alpha1
Block Owner Deletion: true
Controller: true
Kind: Configuration
Name: petrodeg-ms5-default
UID: 5ed02123-1c3f-11ea-b469-000d3a067a44
Resource Version: 28603988
Self Link: /apis/serving.knative.dev/v1alpha1/namespaces/petrodeg-mds5-1/revisions/petrodeg-ms5-default-v9rdc
UID: 5ed5e000-1c3f-11ea-9e0d-000d3a067133
Spec:
Container Concurrency: 0
Containers:
Args:
http
Image: kcorer/jump
Name: user-container
Ports:
Container Port: 8888
Protocol: TCP
Readiness Probe:
Success Threshold: 1
Tcp Socket:
Port: 0
Resources:
Limits:
Cpu: 400m
Memory: 400Mi
Requests:
Cpu: 400m
Memory: 400Mi
Timeout Seconds: 300
Status:
Conditions:
Last Transition Time: 2019-12-11T17:57:37Z
Message: The target could not be activated.
Reason: TimedOut
Severity: Info
Status: False
Type: Active
Last Transition Time: 2019-12-11T17:55:09Z
Reason: Deploying
Status: Unknown
Type: ContainerHealthy
Last Transition Time: 2019-12-11T17:57:10Z
Reason: PodInitializing
Status: False
Type: Ready
Last Transition Time: 2019-12-11T17:57:10Z
Reason: PodInitializing
Status: False
Type: ResourcesAvailable
Image Digest: index.docker.io/kcorer/jump@sha256:4bd27b8fcad2b575b4c30fd0e8c360375020dc4e28923e3d8540779c1f72748b
Log URL: http://localhost:8001/api/v1/namespaces/knative-monitoring/services/kibana-logging/proxy/app/kibana#/discover?_a=(query:(match:(kubernetes.labels.knative-dev%2FrevisionUID:(query:'5ed5e000-1c3f-11ea-9e0d-000d3a067133',type:phrase))))
Observed Generation: 1
Service Name: petrodeg-ms5-default-v9rdc
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning InternalError 24m (x2 over 24m) revision-controller Operation cannot be fulfilled on deployments.apps "petrodeg-ms5-default-v9rdc-deployment": the object has been modified; please apply your changes to the latest version and try again
Steps to Reproduce the Problem
Create a KN service in a ns that does not have any nodes (using taints/tolerations) Wait a bit Remove a taint on a node to allow the pod to be scheduled
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 17 (7 by maintainers)
Commits related to this issue
- Add a new config-default parameter this will guide how long are we allowing the deployment to take before we consider it failed. See #6201 for details. Also make lots of various clean ups in the con... — committed to vagababov/serving by vagababov 4 years ago
- Add a new config-deployment parameter (#7649) * Add a new config-default parameter this will guide how long are we allowing the deployment to take before we consider it failed. See #6201 for details... — committed to knative/serving by vagababov 4 years ago
- Add deployment-config CM to the autoscaler. I thought this is going to be 2 line change, but it turned out to be quite big. This is for #6201 — committed to vagababov/serving by vagababov 4 years ago
- Add deployment-config CM to the autoscaler. I thought this is going to be 2 line change, but it turned out to be quite big. This is for #6201 — committed to vagababov/serving by vagababov 4 years ago
- Add deployment-config CM to the autoscaler. (#7668) I thought this is going to be 2 line change, but it turned out to be quite big. This is for #6201 — committed to knative/serving by vagababov 4 years ago
@vagababov @mattmoor Is it ok to increase the constant to 300s? KFServing downloads the ML models from cloud storage before starting up the server, for large models it can take 1~2 minutes to download the model and also we need to account for ~30s for pod start up time, so I think 5 minutes is reasonable, thoughts?
https://github.com/knative/serving/blob/af63abe0d4aaf9ef9999dd46dc8e933d59b980b3/pkg/reconciler/revision/resources/constants.go#L36
@vagababov any thoughts on this, this is definitely a bumper which we hit so many times.