serving: Activator pod `CrashLoopBackOff` if tracing configmap is malformed on start

In what area(s)?

/area monitoring

What version of Knative?

0.6.0 HEAD

Expected Behavior

If user uses a malformed ConfigMap config-tracing, e.g. setting enable to true without zipkin-endpoint configured, or sample-rate is not a valid float, activator should disable tracing and continue or use a default value. Tracing is an observability feature and its misconfiguration shouldn’t cause the activator to stop working.

Actual Behavior

activator becomes CrashLoopBackOff and keep on restarting when a bad ConfigMap config-tracing is configured.

Knative Eventing has a similar issue #1417. Eventing is using the same tracing lib. They are caused by the same reason.

Steps to Reproduce the Problem

  1. Install Knative Serving
  2. Create a bad ConfigMap config-tracing, e.g.
apiVersion: v1
kind: ConfigMap
metadata:
  name: config-tracing
  namespace: knative-eventing
data:
  enable: "true"

and apply it. 3. Delete activator pod to force it restart, in order to apply the new ConfigMap. 4. The new start activator pod will turn to CrashLoopBackOff.

$ k get pods -n knative-serving
NAME                             READY     STATUS    RESTARTS   AGE
activator-7587c7475b-lwbw2       1/2       Running   2          27s
autoscaler-5bf6cfd9bc-b9fwc      2/2       Running   0          9m16s
controller-dc64767cf-k55wj       1/1       Running   0          9m
networking-istio-fc9c659-zj7sf   1/1       Running   0          8m59s
webhook-5fdcd4499d-jkvpv         1/1       Running   0          8m59s

$ k get pods activator-7587c7475b-lwbw2 -n knative-serving
NAME                         READY     STATUS             RESTARTS   AGE
activator-7587c7475b-lwbw2   1/2       CrashLoopBackOff   2          52s

$ k get pods activator-7587c7475b-lwbw2 -n knative-serving -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/psp: ibm-privileged-psp
    sidecar.istio.io/inject: "true"
    sidecar.istio.io/status: '{"version":"e08692ac44064480d18c557aa7dcaf719ad65ed6225e8937d5bff806605a1cef","initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-certs"],"imagePullSecrets":null}'
  creationTimestamp: 2019-07-03T14:03:52Z
  generateName: activator-7587c7475b-
  labels:
    app: activator
    pod-template-hash: 7587c7475b
    role: activator
    serving.knative.dev/release: v0.6.0
  name: activator-7587c7475b-lwbw2
  namespace: knative-serving
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: activator-7587c7475b
    uid: 26f10ba9-9d9a-11e9-89a4-06f3e4907b6a
  resourceVersion: "338585"
  selfLink: /api/v1/namespaces/knative-serving/pods/activator-7587c7475b-lwbw2
  uid: 639b16de-9d9b-11e9-89a4-06f3e4907b6a
spec:
  containers:
  - args:
    - -logtostderr=false
    - -stderrthreshold=FATAL
    env:
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: SYSTEM_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: CONFIG_LOGGING_NAME
      value: config-logging
    image: gcr.io/knative-releases/github.com/knative/serving/cmd/activator@sha256:f553b6cb7599f2f71190ddc93024952e22f2f55e97a3f38519d4d622fc751651
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 3
      httpGet:
        httpHeaders:
        - name: k-kubelet-probe
          value: activator
        path: /healthz
        port: 8012
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    name: activator
    ports:
    - containerPort: 8012
      name: http1-port
      protocol: TCP
    - containerPort: 8013
      name: h2c-port
      protocol: TCP
    - containerPort: 9090
      name: metrics-port
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        httpHeaders:
        - name: k-kubelet-probe
          value: activator
        path: /healthz
        port: 8012
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      limits:
        cpu: 200m
        memory: 600Mi
      requests:
        cpu: 20m
        memory: 60Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/config-logging
      name: config-logging
    - mountPath: /etc/config-observability
      name: config-observability
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: controller-token-2d485
      readOnly: true
  - args:
    - proxy
    - sidecar
    - --domain
    - $(POD_NAMESPACE).svc.cluster.local
    - --configPath
    - /etc/istio/proxy
    - --binaryPath
    - /usr/local/bin/envoy
    - --serviceCluster
    - activator.$(POD_NAMESPACE)
    - --drainDuration
    - 45s
    - --parentShutdownDuration
    - 1m0s
    - --discoveryAddress
    - istio-pilot.istio-system:15010
    - --zipkinAddress
    - zipkin.istio-system:9411
    - --connectTimeout
    - 10s
    - --proxyAdminPort
    - "15000"
    - --concurrency
    - "2"
    - --controlPlaneAuthPolicy
    - NONE
    - --statusPort
    - "15020"
    - --applicationPorts
    - 8012,8013,9090
    env:
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: POD_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: INSTANCE_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.podIP
    - name: ISTIO_META_POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: ISTIO_META_CONFIG_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: ISTIO_META_INTERCEPTION_MODE
      value: REDIRECT
    - name: ISTIO_METAJSON_ANNOTATIONS
      value: |
        {"kubernetes.io/psp":"ibm-privileged-psp","sidecar.istio.io/inject":"true"}
    - name: ISTIO_METAJSON_LABELS
      value: |
        {"app":"activator","pod-template-hash":"7587c7475b","role":"activator","serving.knative.dev/release":"v0.6.0"}
    image: icr.io/ext/istio/proxyv2:1.1.7
    imagePullPolicy: IfNotPresent
    name: istio-proxy
    ports:
    - containerPort: 15090
      name: http-envoy-prom
      protocol: TCP
    readinessProbe:
      failureThreshold: 30
      httpGet:
        path: /healthz/ready
        port: 15020
        scheme: HTTP
      initialDelaySeconds: 1
      periodSeconds: 2
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      limits:
        cpu: "2"
        memory: 1Gi
      requests:
        cpu: 100m
        memory: 128Mi
    securityContext:
      procMount: Default
      readOnlyRootFilesystem: true
      runAsUser: 1337
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/istio/proxy
      name: istio-envoy
    - mountPath: /etc/certs/
      name: istio-certs
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: controller-token-2d485
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  initContainers:
  - args:
    - -p
    - "15001"
    - -u
    - "1337"
    - -m
    - REDIRECT
    - -i
    - '*'
    - -x
    - ""
    - -b
    - 8012,8013,9090
    - -d
    - "15020"
    image: icr.io/ext/istio/proxy_init:1.1.7
    imagePullPolicy: IfNotPresent
    name: istio-init
    resources:
      limits:
        cpu: 100m
        memory: 50Mi
      requests:
        cpu: 10m
        memory: 10Mi
    securityContext:
      capabilities:
        add:
        - NET_ADMIN
      procMount: Default
      runAsNonRoot: false
      runAsUser: 0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  nodeName: 10.138.173.69
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: controller
  serviceAccountName: controller
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - configMap:
      defaultMode: 420
      name: config-logging
    name: config-logging
  - configMap:
      defaultMode: 420
      name: config-observability
    name: config-observability
  - name: controller-token-2d485
    secret:
      defaultMode: 420
      secretName: controller-token-2d485
  - emptyDir:
      medium: Memory
    name: istio-envoy
  - name: istio-certs
    secret:
      defaultMode: 420
      optional: true
      secretName: istio.controller
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2019-07-03T14:03:56Z
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: 2019-07-03T14:03:52Z
    message: 'containers with unready status: [activator]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: 2019-07-03T14:03:52Z
    message: 'containers with unready status: [activator]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: 2019-07-03T14:03:52Z
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://4d53f35b4b81f716e860456b5065abc2739562c5d5ec9540f537bcd0f8b4c1e9
    image: sha256:bf56e322a8bfa210170c2bd500bca29258ed66dadc702dc12a6d97ef48b23286
    imageID: gcr.io/knative-releases/github.com/knative/serving/cmd/activator@sha256:f553b6cb7599f2f71190ddc93024952e22f2f55e97a3f38519d4d622fc751651
    lastState:
      terminated:
        containerID: containerd://4d53f35b4b81f716e860456b5065abc2739562c5d5ec9540f537bcd0f8b4c1e9
        exitCode: 1
        finishedAt: 2019-07-03T14:04:53Z
        reason: Error
        startedAt: 2019-07-03T14:04:53Z
    name: activator
    ready: false
    restartCount: 3
    state:
      waiting:
        message: Back-off 40s restarting failed container=activator pod=activator-7587c7475b-lwbw2_knative-serving(639b16de-9d9b-11e9-89a4-06f3e4907b6a)
        reason: CrashLoopBackOff
  - containerID: containerd://401bb101a73e778e1a6efe9fffffc2965d3837b3130dd5e6f703ea61e053f5dc
    image: icr.io/ext/istio/proxyv2:1.1.7
    imageID: icr.io/ext/istio/proxyv2@sha256:e6f039115c7d5ef9c8f6b049866fbf9b6f5e2255d3a733bb8756b36927749822
    lastState: {}
    name: istio-proxy
    ready: true
    restartCount: 0
    state:
      running:
        startedAt: 2019-07-03T14:03:57Z
  hostIP: 10.138.173.69
  initContainerStatuses:
  - containerID: containerd://b6b62a4ef4c9a4df595d24318bdac434356ff617374d27d25e7697226df9f190
    image: icr.io/ext/istio/proxy_init:1.1.7
    imageID: icr.io/ext/istio/proxy_init@sha256:9056ebb0757be99006ad568331e11bee99ae0daaa4459e7c15dfaf0e0cba2f48
    lastState: {}
    name: istio-init
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: containerd://b6b62a4ef4c9a4df595d24318bdac434356ff617374d27d25e7697226df9f190
        exitCode: 0
        finishedAt: 2019-07-03T14:03:55Z
        reason: Completed
        startedAt: 2019-07-03T14:03:54Z
  phase: Running
  podIP: 172.30.158.33
  qosClass: Burstable
  startTime: 2019-07-03T14:03:52Z

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 18 (16 by maintainers)

Most upvoted comments

Not failing in the presence of malformed configmaps seems like useful resilience, but I think I’d rather see us solve the first-order problem first, which is getting better synchronous validation of our configmaps.

I forget who I was talking to about this recently, but the tl;dr was: Let’s extend out webhook to validate our configmaps.

Today the key configuration our webhook takes is this map here: https://github.com/knative/serving/blob/c6f7bc48faa2c882539d56719d5b23fdd653ea5c/cmd/webhook/main.go#L108-L123

… which combined with apis.Validatable and apis.Defaultable form the core of our webhook logic.

I would propose that to synchronously validate ConfigMaps we also allow webhook consumers to provide a mapping like those used in our controllers’ respective stores e.g. here: https://github.com/knative/serving/blob/c6f7bc48faa2c882539d56719d5b23fdd653ea5c/pkg/reconciler/revision/config/store.go#L64-L71

The shared webhook infrastructure would then (when these are provided) register a validating webhook on ConfigMaps in system.Namespace() and dispatch on the ConfigMap name to invoke the appropriate constructor. If it errors, then we would return the error and reject the configmap update synchronously.

I think with such a defense in place, it will be much harder to get into the bad state discussed here, and much more obvious to folks that hit these guard rails what the problem is. @nimakaviani would you be up for tackling this? I think it’d be extremely useful.