serving: Activator pod failing with `CrashLoopBackOff` on GKE

In what area(s)?

/area API /area autoscale

What version of Knative?

0.5.x

Expected Behavior

All pods in ns knative-serving on GKE are running.

Actual Behavior

activator-service doesn’t have minimum availabaility.

➜  ~ kubectl get pods --namespace knative-serving
NAME                          READY   STATUS    RESTARTS   AGE
activator-7ffc4764bd-d9mgx    1/2     Running   7          8m
autoscaler-6b67f6ffc7-qxt8m   2/2     Running   0          11m
controller-b54c5f66-tp9hz     1/1     Running   0          11m
webhook-5dff4764d-d2rss       1/1     Running   0          11m

Describe on pod shows liveness and readiness probe is failing.

➜  ~ kubectl describe pod activator-7ffc4764bd-d9mgx -n knative-serving
Name:               activator-7ffc4764bd-d9mgx
Namespace:          knative-serving
Priority:           0
PriorityClassName:  <none>
Node:               gke-euwe2-preemptible-2-6fb4f449-g2t5/10.31.252.108
Start Time:         Tue, 23 Apr 2019 11:10:38 +0200
Labels:             app=activator
                    pod-template-hash=3997032068
                    role=activator
                    serving.knative.dev/release=devel
Annotations:        cni.projectcalico.org/podIP: 10.30.7.38/32
                    sidecar.istio.io/inject: true
                    sidecar.istio.io/status:
                      {"version":"bdaecb667bc51926819c154cd5e5f8922a3c562816a4f163d67ad95223aa7612","initContainers":["istio-init"],"containers":["istio-proxy"]...
Status:             Running
IP:                 10.30.7.38
Controlled By:      ReplicaSet/activator-7ffc4764bd
Init Containers:
  istio-init:
    Container ID:  docker://c45f7ae9c692ab5231ad0d7c3418ed47249f66117efb345336054dfe9b42087e
    Image:         gcr.io/istio-release/proxy_init:release-1.1-latest-daily
    Image ID:      docker-pullable://gcr.io/istio-release/proxy_init@sha256:8622455bb6825341b83ed191233357286bbd6f88ba7e28949d8056ec7d906842
    Port:          <none>
    Host Port:     <none>
    Args:
      -p
      15001
      -u
      1337
      -m
      REDIRECT
      -i
      *
      -x

      -b
      8080,8081,9090
      -d
      15020
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 23 Apr 2019 11:10:39 +0200
      Finished:     Tue, 23 Apr 2019 11:10:40 +0200
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  50Mi
    Requests:
      cpu:        10m
      memory:     10Mi
    Environment:  <none>
    Mounts:       <none>
Containers:
  activator:
    Container ID:  docker://87bfebc2fc8dff0b1d2b29a4e9b9d1499e9a338c4b40675c637e8580ceb93934
    Image:         gcr.io/knative-releases/github.com/knative/serving/cmd/activator@sha256:bc61131f2d3f48fb97a1a98a3be8a34bb65905428dbbedeabe24eb949d6b53bc
    Image ID:      docker-pullable://gcr.io/knative-releases/github.com/knative/serving/cmd/activator@sha256:bc61131f2d3f48fb97a1a98a3be8a34bb65905428dbbedeabe24eb949d6b53bc
    Ports:         8080/TCP, 8081/TCP, 9090/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Args:
      -logtostderr=false
      -stderrthreshold=FATAL
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 23 Apr 2019 11:15:38 +0200
      Finished:     Tue, 23 Apr 2019 11:16:05 +0200
    Ready:          False
    Restart Count:  6
    Limits:
      cpu:     200m
      memory:  600Mi
    Requests:
      cpu:      20m
      memory:   60Mi
    Liveness:   http-get http://:8080/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:8080/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAME:             activator-7ffc4764bd-d9mgx (v1:metadata.name)
      SYSTEM_NAMESPACE:     knative-serving (v1:metadata.namespace)
      CONFIG_LOGGING_NAME:  config-logging
    Mounts:
      /etc/config-logging from config-logging (rw)
      /etc/config-observability from config-observability (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from controller-token-6gbsg (ro)
  istio-proxy:
    Container ID:  docker://787997a397c65d0e01476ba029f6fb3f5df49adfa33e1152e5031e6adbfcf16a
    Image:         gcr.io/istio-release/proxyv2:release-1.1-latest-daily
    Image ID:      docker-pullable://gcr.io/istio-release/proxyv2@sha256:23483db628fcf57ce6131c5b86decd97bf2f9a1aa67b13b5e23b05c8f3cf1e33
    Port:          15090/TCP
    Host Port:     0/TCP
    Args:
      proxy
      sidecar
      --domain
      $(POD_NAMESPACE).svc.cluster.local
      --configPath
      /etc/istio/proxy
      --binaryPath
      /usr/local/bin/envoy
      --serviceCluster
      activator.$(POD_NAMESPACE)
      --drainDuration
      45s
      --parentShutdownDuration
      1m0s
      --discoveryAddress
      istio-pilot.istio-system:15011
      --zipkinAddress
      zipkin.istio-system:9411
      --connectTimeout
      10s
      --proxyAdminPort
      15000
      --concurrency
      2
      --controlPlaneAuthPolicy
      MUTUAL_TLS
      --statusPort
      15020
      --applicationPorts
      8080,8081,9090
    State:          Running
      Started:      Tue, 23 Apr 2019 11:10:41 +0200
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  128Mi
    Requests:
      cpu:      100m
      memory:   128Mi
    Readiness:  http-get http://:15020/healthz/ready delay=1s timeout=1s period=2s #success=1 #failure=30
    Environment:
      POD_NAME:                      activator-7ffc4764bd-d9mgx (v1:metadata.name)
      POD_NAMESPACE:                 knative-serving (v1:metadata.namespace)
      INSTANCE_IP:                    (v1:status.podIP)
      ISTIO_META_POD_NAME:           activator-7ffc4764bd-d9mgx (v1:metadata.name)
      ISTIO_META_CONFIG_NAMESPACE:   knative-serving (v1:metadata.namespace)
      ISTIO_META_INTERCEPTION_MODE:  REDIRECT
      ISTIO_METAJSON_ANNOTATIONS:    {"sidecar.istio.io/inject":"true"}

      ISTIO_METAJSON_LABELS:         {"app":"activator","pod-template-hash":"3997032068","role":"activator","serving.knative.dev/release":"devel"}

    Mounts:
      /etc/certs/ from istio-certs (ro)
      /etc/istio/proxy from istio-envoy (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from controller-token-6gbsg (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  config-logging:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      config-logging
    Optional:  false
  config-observability:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      config-observability
    Optional:  false
  controller-token-6gbsg:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  controller-token-6gbsg
    Optional:    false
  istio-envoy:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:  Memory
  istio-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  istio.controller
    Optional:    true
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                    From                                            Message
  ----     ------            ----                   ----                                            -------
  Normal   Scheduled         8m1s                   default-scheduler                               Successfully assigned knative-serving/activator-7ffc4764bd-d9mgx to gke-euwe2-preemptible-2-6fb4f449-g2t5
  Normal   Pulled            8m                     kubelet, gke-euwe2-preemptible-2-6fb4f449-g2t5  Container image "gcr.io/istio-release/proxy_init:release-1.1-latest-daily" already present on machine
  Normal   Created           8m                     kubelet, gke-euwe2-preemptible-2-6fb4f449-g2t5  Created container
  Normal   Started           8m                     kubelet, gke-euwe2-preemptible-2-6fb4f449-g2t5  Started container
  Normal   Pulling           7m59s                  kubelet, gke-euwe2-preemptible-2-6fb4f449-g2t5  pulling image "gcr.io/knative-releases/github.com/knative/serving/cmd/activator@sha256:bc61131f2d3f48fb97a1a98a3be8a34bb65905428dbbedeabe24eb949d6b53bc"
  Normal   Started           7m58s                  kubelet, gke-euwe2-preemptible-2-6fb4f449-g2t5  Started container
  Normal   Pulled            7m58s                  kubelet, gke-euwe2-preemptible-2-6fb4f449-g2t5  Successfully pulled image "gcr.io/knative-releases/github.com/knative/serving/cmd/activator@sha256:bc61131f2d3f48fb97a1a98a3be8a34bb65905428dbbedeabe24eb949d6b53bc"
  Normal   Created           7m58s                  kubelet, gke-euwe2-preemptible-2-6fb4f449-g2t5  Created container
  Normal   Started           7m58s                  kubelet, gke-euwe2-preemptible-2-6fb4f449-g2t5  Started container
  Normal   Pulled            7m58s                  kubelet, gke-euwe2-preemptible-2-6fb4f449-g2t5  Container image "gcr.io/istio-release/proxyv2:release-1.1-latest-daily" already present on machine
  Warning  Unhealthy         7m54s                  kubelet, gke-euwe2-preemptible-2-6fb4f449-g2t5  Liveness probe failed: Get http://10.30.7.38:8080/healthz: read tcp 10.31.252.108:44088->10.30.7.38:8080: read: connection reset by peer
  Warning  Unhealthy         7m47s                  kubelet, gke-euwe2-preemptible-2-6fb4f449-g2t5  Readiness probe failed: Get http://10.30.7.38:8080/healthz: read tcp 10.31.252.108:44140->10.30.7.38:8080: read: connection reset by peer
  Warning  Unhealthy         7m44s                  kubelet, gke-euwe2-preemptible-2-6fb4f449-g2t5  Liveness probe failed: Get http://10.30.7.38:8080/healthz: read tcp 10.31.252.108:44172->10.30.7.38:8080: read: connection reset by peer
  Warning  Unhealthy         7m37s                  kubelet, gke-euwe2-preemptible-2-6fb4f449-g2t5  Readiness probe failed: Get http://10.30.7.38:8080/healthz: read tcp 10.31.252.108:44230->10.30.7.38:8080: read: connection reset by peer
  Warning  Unhealthy         7m34s                  kubelet, gke-euwe2-preemptible-2-6fb4f449-g2t5  Liveness probe failed: Get http://10.30.7.38:8080/healthz: read tcp 10.31.252.108:44262->10.30.7.38:8080: read: connection reset by peer
  Normal   Killing           7m34s                  kubelet, gke-euwe2-preemptible-2-6fb4f449-g2t5  Killing container with id docker://activator:Container failed liveness probe.. Container will be killed and recreated.
  Normal   Pulled            7m34s                  kubelet, gke-euwe2-preemptible-2-6fb4f449-g2t5  Container image "gcr.io/knative-releases/github.com/knative/serving/cmd/activator@sha256:bc61131f2d3f48fb97a1a98a3be8a34bb65905428dbbedeabe24eb949d6b53bc" already present on machine
  Normal   Created           7m33s (x2 over 7m58s)  kubelet, gke-euwe2-preemptible-2-6fb4f449-g2t5  Created container
  Warning  DNSConfigForming  3m1s (x32 over 8m1s)   kubelet, gke-euwe2-preemptible-2-6fb4f449-g2t5  Search Line limits were exceeded, some search paths have been omitted, the applied search line is: knative-serving.svc.cluster.local svc.cluster.local cluster.local c.mb-k8s-tst.internal google.internal global

Steps to Reproduce the Problem

Install knative on existing cluster with Istio installation (version > 1.1 ) already available

kubectl apply --selector knative.dev/crd-install=true \
--filename https://github.com/knative/serving/releases/download/v0.5.0/serving.yaml \
--filename https://github.com/knative/build/releases/download/v0.5.0/build.yaml \
--filename https://github.com/knative/eventing/releases/download/v0.5.0/release.yaml \
--filename https://github.com/knative/eventing-sources/releases/download/v0.5.0/eventing-sources.yaml \
--filename https://github.com/knative/serving/releases/download/v0.5.0/monitoring.yaml \
--filename https://raw.githubusercontent.com/knative/serving/v0.5.0/third_party/config/build/clusterrole.yaml

kubectl apply --filename https://github.com/knative/serving/releases/download/v0.5.0/serving.yaml \
--filename https://github.com/knative/build/releases/download/v0.5.0/build.yaml \
--filename https://github.com/knative/eventing/releases/download/v0.5.0/release.yaml \
--filename https://github.com/knative/eventing-sources/releases/download/v0.5.0/eventing-sources.yaml \
--filename https://github.com/knative/serving/releases/download/v0.5.0/monitoring.yaml \
--filename https://raw.githubusercontent.com/knative/serving/v0.5.0/third_party/config/build/clusterrole.yaml

activator service on knative-serving failed to reach availability.

Additional info

Logs on specific pod, doesn’t shows much errors but only that the pod gets restarted several times.

➜  ~ kubectl logs activator-7ffc4764bd-d9mgx -n knative-serving -c activator
{"level":"info","ts":"2019-04-23T09:24:56.125Z","caller":"logging/config.go:96","msg":"Successfully created the logger.","knative.dev/jsonconfig":"{\n  \"level\": \"info\",\n  \"development\": false,\n  \"outputPaths\": [\"stdout\"],\n  \"errorOutputPaths\": [\"stderr\"],\n  \"encoding\": \"json\",\n  \"encoderConfig\": {\n    \"timeKey\": \"ts\",\n    \"levelKey\": \"level\",\n    \"nameKey\": \"logger\",\n    \"callerKey\": \"caller\",\n    \"messageKey\": \"msg\",\n    \"stacktraceKey\": \"stacktrace\",\n    \"lineEnding\": \"\",\n    \"levelEncoder\": \"\",\n    \"timeEncoder\": \"iso8601\",\n    \"durationEncoder\": \"\",\n    \"callerEncoder\": \"\"\n  }\n}"}
{"level":"info","ts":"2019-04-23T09:24:56.126Z","caller":"logging/config.go:97","msg":"Logging level set to info"}
{"level":"warn","ts":"2019-04-23T09:24:56.126Z","caller":"logging/config.go:65","msg":"Fetch GitHub commit ID from kodata failed: \"ref: refs/heads/upstream/release-0.5\" is not a valid GitHub commit ID"}
{"level":"info","ts":"2019-04-23T09:24:56.126Z","logger":"activator","caller":"activator/main.go:134","msg":"Starting the knative activator","knative.dev/controller":"activator"}
{"level":"info","ts":"2019-04-23T09:24:56.223Z","logger":"activator","caller":"activator/main.go:183","msg":"Waiting for informer caches to sync","knative.dev/controller":"activator"}
{"level":"info","ts":"2019-04-23T09:24:56.524Z","logger":"activator","caller":"activator/main.go:246","msg":"Connecting to autoscaler atws://autoscaler.knative-serving.svc.cluster.local:8080","knative.dev/controller":"activator"}
{"level":"info","ts":"2019-04-23T09:24:56.524Z","logger":"activator","caller":"util/env.go:33","msg":"POD_NAME=activator-7ffc4764bd-d9mgx","knative.dev/controller":"activator"}
{"level":"info","ts":"2019-04-23T09:24:56.524Z","logger":"activator","caller":"websocket/connection.go:128","msg":"Connecting to \"ws://autoscaler.knative-serving.svc.cluster.local:8080\"","knative.dev/controller":"activator"}
{"level":"info","ts":"2019-04-23T09:24:56.608Z","logger":"activator","caller":"metrics/prometheus_exporter.go:37","msg":"Created Opencensus Prometheus exporter with config: &{knative.dev/serving activator prometheus 5000000000 9090  false false  }. Start the server for Prometheus exporter.","knative.dev/controller":"activator"}
{"level":"info","ts":"2019-04-23T09:24:56.608Z","logger":"activator","caller":"metrics/config.go:236","msg":"Successfully updated the metrics exporter; old config: <nil>; new config &{knative.dev/serving activator prometheus 5000000000 9090  false false  }","knative.dev/controller":"activator"}
{"level":"info","ts":"2019-04-23T09:24:56.615Z","logger":"activator","caller":"websocket/connection.go:133","msg":"Connected to \"ws://autoscaler.knative-serving.svc.cluster.local:8080\"","knative.dev/controller":"activator"}

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 25 (13 by maintainers)

Most upvoted comments

@kbruner I used your workaround for my Cluster，the activator did not CrashLoopBackOff，but its logs also have error

kubectl logs -n knative-serving activator-578fd6567f-g2j85  activator
.....
{"level":"error","ts":"2019-06-12T01:52:27.026Z","logger":"activator","caller":"websocket/connection.go:158","msg":"Failed to send ping message","knative.dev/controller":"activator","error":"connection has not yet been established","stacktrace":"github.com/knative/serving/vendor/github.com/knative/pkg/websocket.NewDurableConnection.func3\n\t/go/src/github.com/knative/serving/vendor/github.com/knative/pkg/websocket/connection.go:158"}
{"level":"error","ts":"2019-06-12T01:52:30.332Z","logger":"activator","caller":"websocket/connection.go:158","msg":"Failed to send ping message","knative.dev/controller":"activator","error":"connection has not yet been established","stacktrace":"github.com/knative/serving/vendor/github.com/knative/pkg/websocket.NewDurableConnection.func3\n\t/go/src/github.com/knative/serving/vendor/github.com/knative/pkg/websocket/connection.go:158"}

Do you face this problem?

yuxiaoba on Jun 12, 2019

It looks like recent commits in this repo are attempting to address the probe issues on Istio. For those that need knative-serving 0.5.2 to work now, though:

Having the same issue on GKE, knative-serving 0.5.2, Istio 1.1.6. As mTLS enforcement is an Istio best practice and likely true for many clusters that were already running on Istio for security reasons, and because Istio’s own recommended options for making HTTP health check probes work with mTLS are very much not one-size-fits-all, this will likely bite a lot of people.

Additionally, activator tries to make network connections to github.com on start. In Istio clusters where cluster egress is whitelist-only, which was the default before Istio 1.1, that’s not possible unless the cluster has an appropriate ServiceEntry resource (also a VirtualService resource in Istio versions before 1.1). This does not seem to break anything when running production release images, though.

I would also recommend giving the readinessProbe a longer timeout. There’s a race condition on pod start with the istio-proxy sidecar container. Ingress won’t work until the proxy has started accepting network connections, but it needs to pull its configuration from istio-pilot before it will start listening. If the configuration is large or istio-pilot is slow, that can conceivably take longer than 30 seconds.

Pod egress is also not possible with some Istio configurations until the istio-proxy container is live and healthy. This means that when activator tries to pull from github.com immediately on process start and without retrying on initial failures, there’s a good chance the network connections will fail even if egress is allowed. Frankly, there’s no good workaround for that. In containers that have nc or curl installed, you can use that to “wait for” the istio-proxy port, or if /bin/sleep is present, you can wait before starting the activator command (after bumping up the readinessProbe timeout), but the activator container image doesn’t even have a shell installed. However, that doesn’t seem to break anything that I can tell with the release images.

Here’s my workaround for the probe issues with activator, which might only work if you have global.disablePolicyChecks=false (true is the default in Istio 1.1; I believe this option was added in 1.1 and previously, policy checks were always on, but I’m not 100% certain). This policy disables the mTLS requirement for activator-service, so the istio-proxy sidecar will accept peer connections with or without mTLS certificates. Don’t do this in production if you have strict zero-trust network requirements:

apiVersion: authentication.istio.io/v1alpha1
kind: Policy
metadata:
  name: activator-service
  namespace: knative-serving
spec:
  targets:
  - name: activator-service
  peers:
  - mtls:
      mode: PERMISSIVE

If you need policy checks to stay disabled, an unpleasant short-term workaround would be to delete the livenessProbe and readinessProbe from the activator deployment.

kbruner on May 15, 2019