serving: Service getting scaled down on Timeout before Timeout?

In what area(s)?

/area serving

What version of Knative?

0.9

Expected Behavior

Deployed a service with default revision timeout (300s)… however the service gets marked as timeout after 2m28s causing the deployment to be scaled down to 0 before the service has a chance to come up.

Actual Behavior

NAME                                                     READY   STATUS    RESTARTS   AGE
petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5   0/3     Pending   0          105s
petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5   0/3     Pending   0          106s
petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5   0/3     Init:0/1   0          111s
petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5   0/3     Init:0/1   0          2m28s
petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5   0/3     Terminating   0          2m28s
petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5   0/3     Terminating   0          2m28s

Events:

$ k get events -n petrodeg-mds5-1
LAST SEEN   TYPE      REASON                OBJECT                                                         MESSAGE
16s         Normal    Synced                azurekeyvaultsecret/mir-int-westus2-tenant-viennadroptestkey   AzureKeyVaultSecret synced successfully
24m         Warning   FailedScheduling      pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5     0/14 nodes are available: 14 node(s) didn't match node selector, 3 Insufficient memory.
24m         Normal    NotTriggerScaleUp     pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5     pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) didn't match node selector
23m         Warning   FailedScheduling      pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5     0/15 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 14 node(s) didn't match node selector, 3 Insufficient memory.
23m         Normal    Scheduled             pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5     Successfully assigned petrodeg-mds5-1/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 to k8s-petrodeg-mds5-1-42413502-vmss000001
23m         Normal    Pulling               pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5     Pulling image "docker.io/istio/proxy_init:1.3.0"
23m         Normal    Pulled                pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5     Successfully pulled image "docker.io/istio/proxy_init:1.3.0"
23m         Normal    Created               pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5     Created container istio-init
23m         Normal    Started               pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5     Started container istio-init
25m         Normal    SuccessfulCreate      replicaset/petrodeg-ms5-default-v9rdc-deployment-766bf565cc    Created pod: petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5
23m         Normal    SuccessfulDelete      replicaset/petrodeg-ms5-default-v9rdc-deployment-766bf565cc    Deleted pod: petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5
25m         Normal    ScalingReplicaSet     deployment/petrodeg-ms5-default-v9rdc-deployment               Scaled up replica set petrodeg-ms5-default-v9rdc-deployment-766bf565cc to 1
23m         Normal    ScalingReplicaSet     deployment/petrodeg-ms5-default-v9rdc-deployment               Scaled down replica set petrodeg-ms5-default-v9rdc-deployment-766bf565cc to 0
25m         Normal    Updated               metric/petrodeg-ms5-default-v9rdc                              Successfully updated metric status petrodeg-mds5-1/petrodeg-ms5-default-v9rdc
23m         Normal    Updated               serverlessservice/petrodeg-ms5-default-v9rdc                   Successfully updated ServerlessService "petrodeg-mds5-1/petrodeg-ms5-default-v9rdc"
23m         Warning   InternalError         revision/petrodeg-ms5-default-v9rdc                            Operation cannot be fulfilled on deployments.apps "petrodeg-ms5-default-v9rdc-deployment": the object has been modified; please apply your changes to the latest version and try again
23m         Warning   UpdateFailed          podautoscaler/petrodeg-ms5-default-v9rdc                       Failed to update status for PA "petrodeg-ms5-default-v9rdc": Operation cannot be fulfilled on podautoscalers.autoscaling.internal.knative.dev "petrodeg-ms5-default-v9rdc": the object has been modified; please apply your changes to the latest version and try again
25m         Normal    Created               configuration/petrodeg-ms5-default                             Created Revision "petrodeg-ms5-default-v9rdc"
23m         Warning   LatestCreatedFailed   configuration/petrodeg-ms5-default                             Latest created revision "petrodeg-ms5-default-v9rdc" has failed

Revision:

$ k describe revision petrodeg-ms5-default-v9rdc -n petrodeg-mds5-1
Name:         petrodeg-ms5-default-v9rdc
Namespace:    petrodeg-mds5-1
Labels:  serving.knative.dev/configuration=petrodeg-ms5-default
              serving.knative.dev/configurationGeneration=1
              serving.knative.dev/service=
              serving.kubeflow.org/kfservice=petrodeg-ms5
Annotations:  autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
              autoscaling.knative.dev/maxScale: 1
              autoscaling.knative.dev/minScale: 1
              autoscaling.knative.dev/target: 1
              autoscaling.knative.dev/targetBurstCapacity: 0
              networking.knative.dev/customExternalIPGatewayName: petrodeg-mds5-1/petrodeg-ms5-1
API Version:  serving.knative.dev/v1alpha1
Kind:         Revision
Metadata:
  Creation Timestamp:  2019-12-11T17:55:08Z
  Generate Name:       petrodeg-ms5-default-
  Generation:          1
  Owner References:
    API Version:           serving.knative.dev/v1alpha1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Configuration
    Name:                  petrodeg-ms5-default
    UID:                   5ed02123-1c3f-11ea-b469-000d3a067a44
  Resource Version:        28603988
  Self Link:               /apis/serving.knative.dev/v1alpha1/namespaces/petrodeg-mds5-1/revisions/petrodeg-ms5-default-v9rdc
  UID:                     5ed5e000-1c3f-11ea-9e0d-000d3a067133
Spec:
  Container Concurrency:  0
  Containers:
    Args:
      http
    Image:  kcorer/jump
    Name:   user-container
    Ports:
      Container Port:  8888
      Protocol:        TCP
    Readiness Probe:
      Success Threshold:  1
      Tcp Socket:
        Port:  0
    Resources:
      Limits:
        Cpu:     400m
        Memory:  400Mi
      Requests:
        Cpu:             400m
        Memory:          400Mi
  Timeout Seconds:       300
Status:
  Conditions:
    Last Transition Time:  2019-12-11T17:57:37Z
    Message:               The target could not be activated.
    Reason:                TimedOut
    Severity:              Info
    Status:                False
    Type:                  Active
    Last Transition Time:  2019-12-11T17:55:09Z
    Reason:                Deploying
    Status:                Unknown
    Type:                  ContainerHealthy
    Last Transition Time:  2019-12-11T17:57:10Z
    Reason:                PodInitializing
    Status:                False
    Type:                  Ready
    Last Transition Time:  2019-12-11T17:57:10Z
    Reason:                PodInitializing
    Status:                False
    Type:                  ResourcesAvailable
  Image Digest:            index.docker.io/kcorer/jump@sha256:4bd27b8fcad2b575b4c30fd0e8c360375020dc4e28923e3d8540779c1f72748b
  Log URL:                 http://localhost:8001/api/v1/namespaces/knative-monitoring/services/kibana-logging/proxy/app/kibana#/discover?_a=(query:(match:(kubernetes.labels.knative-dev%2FrevisionUID:(query:'5ed5e000-1c3f-11ea-9e0d-000d3a067133',type:phrase))))
  Observed Generation:     1
  Service Name:            petrodeg-ms5-default-v9rdc
Events:
  Type     Reason         Age                From                 Message
  ----     ------         ----               ----                 -------
  Warning  InternalError  24m (x2 over 24m)  revision-controller  Operation cannot be fulfilled on deployments.apps "petrodeg-ms5-default-v9rdc-deployment": the object has been modified; please apply your changes to the latest version and try again

Steps to Reproduce the Problem

Create a KN service in a ns that does not have any nodes (using taints/tolerations) Wait a bit Remove a taint on a node to allow the pod to be scheduled

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 17 (7 by maintainers)

Commits related to this issue

Add a new config-default parameter this will guide how long are we allowing the deployment to take before we consider it failed. See #6201 for details. Also make lots of various clean ups in the con... — committed to vagababov/serving by vagababov 4 years ago
Add a new config-deployment parameter (#7649) * Add a new config-default parameter this will guide how long are we allowing the deployment to take before we consider it failed. See #6201 for details... — committed to knative/serving by vagababov 4 years ago
Add deployment-config CM to the autoscaler. I thought this is going to be 2 line change, but it turned out to be quite big. This is for #6201 — committed to vagababov/serving by vagababov 4 years ago
Add deployment-config CM to the autoscaler. I thought this is going to be 2 line change, but it turned out to be quite big. This is for #6201 — committed to vagababov/serving by vagababov 4 years ago
Add deployment-config CM to the autoscaler. (#7668) I thought this is going to be 2 line change, but it turned out to be quite big. This is for #6201 — committed to knative/serving by vagababov 4 years ago

Most upvoted comments

@vagababov @mattmoor Is it ok to increase the constant to 300s? KFServing downloads the ML models from cloud storage before starting up the server, for large models it can take 1~2 minutes to download the model and also we need to account for ~30s for pod start up time, so I think 5 minutes is reasonable, thoughts?

https://github.com/knative/serving/blob/af63abe0d4aaf9ef9999dd46dc8e933d59b980b3/pkg/reconciler/revision/resources/constants.go#L36

yuzisun on Apr 19, 2020

@vagababov any thoughts on this, this is definitely a bumper which we hit so many times.

yuzisun on Mar 13, 2020