serving: Activator pod `CrashLoopBackOff` if tracing configmap is malformed on start
In what area(s)?
/area monitoring
What version of Knative?
0.6.0 HEAD
Expected Behavior
If user uses a malformed ConfigMap config-tracing, e.g. setting enable to true without zipkin-endpoint configured, or sample-rate is not a valid float, activator should disable tracing and continue or use a default value. Tracing is an observability feature and its misconfiguration shouldn’t cause the activator to stop working.
Actual Behavior
activator becomes CrashLoopBackOff and keep on restarting when a bad ConfigMap config-tracing is configured.
Knative Eventing has a similar issue #1417. Eventing is using the same tracing lib. They are caused by the same reason.
Steps to Reproduce the Problem
- Install Knative Serving
- Create a bad ConfigMap
config-tracing, e.g.
apiVersion: v1
kind: ConfigMap
metadata:
name: config-tracing
namespace: knative-eventing
data:
enable: "true"
and apply it.
3. Delete activator pod to force it restart, in order to apply the new ConfigMap.
4. The new start activator pod will turn to CrashLoopBackOff.
$ k get pods -n knative-serving
NAME READY STATUS RESTARTS AGE
activator-7587c7475b-lwbw2 1/2 Running 2 27s
autoscaler-5bf6cfd9bc-b9fwc 2/2 Running 0 9m16s
controller-dc64767cf-k55wj 1/1 Running 0 9m
networking-istio-fc9c659-zj7sf 1/1 Running 0 8m59s
webhook-5fdcd4499d-jkvpv 1/1 Running 0 8m59s
$ k get pods activator-7587c7475b-lwbw2 -n knative-serving
NAME READY STATUS RESTARTS AGE
activator-7587c7475b-lwbw2 1/2 CrashLoopBackOff 2 52s
$ k get pods activator-7587c7475b-lwbw2 -n knative-serving -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
kubernetes.io/psp: ibm-privileged-psp
sidecar.istio.io/inject: "true"
sidecar.istio.io/status: '{"version":"e08692ac44064480d18c557aa7dcaf719ad65ed6225e8937d5bff806605a1cef","initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-certs"],"imagePullSecrets":null}'
creationTimestamp: 2019-07-03T14:03:52Z
generateName: activator-7587c7475b-
labels:
app: activator
pod-template-hash: 7587c7475b
role: activator
serving.knative.dev/release: v0.6.0
name: activator-7587c7475b-lwbw2
namespace: knative-serving
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: activator-7587c7475b
uid: 26f10ba9-9d9a-11e9-89a4-06f3e4907b6a
resourceVersion: "338585"
selfLink: /api/v1/namespaces/knative-serving/pods/activator-7587c7475b-lwbw2
uid: 639b16de-9d9b-11e9-89a4-06f3e4907b6a
spec:
containers:
- args:
- -logtostderr=false
- -stderrthreshold=FATAL
env:
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: SYSTEM_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: CONFIG_LOGGING_NAME
value: config-logging
image: gcr.io/knative-releases/github.com/knative/serving/cmd/activator@sha256:f553b6cb7599f2f71190ddc93024952e22f2f55e97a3f38519d4d622fc751651
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
httpHeaders:
- name: k-kubelet-probe
value: activator
path: /healthz
port: 8012
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: activator
ports:
- containerPort: 8012
name: http1-port
protocol: TCP
- containerPort: 8013
name: h2c-port
protocol: TCP
- containerPort: 9090
name: metrics-port
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
httpHeaders:
- name: k-kubelet-probe
value: activator
path: /healthz
port: 8012
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: 200m
memory: 600Mi
requests:
cpu: 20m
memory: 60Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/config-logging
name: config-logging
- mountPath: /etc/config-observability
name: config-observability
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: controller-token-2d485
readOnly: true
- args:
- proxy
- sidecar
- --domain
- $(POD_NAMESPACE).svc.cluster.local
- --configPath
- /etc/istio/proxy
- --binaryPath
- /usr/local/bin/envoy
- --serviceCluster
- activator.$(POD_NAMESPACE)
- --drainDuration
- 45s
- --parentShutdownDuration
- 1m0s
- --discoveryAddress
- istio-pilot.istio-system:15010
- --zipkinAddress
- zipkin.istio-system:9411
- --connectTimeout
- 10s
- --proxyAdminPort
- "15000"
- --concurrency
- "2"
- --controlPlaneAuthPolicy
- NONE
- --statusPort
- "15020"
- --applicationPorts
- 8012,8013,9090
env:
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: INSTANCE_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
- name: ISTIO_META_POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: ISTIO_META_CONFIG_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: ISTIO_META_INTERCEPTION_MODE
value: REDIRECT
- name: ISTIO_METAJSON_ANNOTATIONS
value: |
{"kubernetes.io/psp":"ibm-privileged-psp","sidecar.istio.io/inject":"true"}
- name: ISTIO_METAJSON_LABELS
value: |
{"app":"activator","pod-template-hash":"7587c7475b","role":"activator","serving.knative.dev/release":"v0.6.0"}
image: icr.io/ext/istio/proxyv2:1.1.7
imagePullPolicy: IfNotPresent
name: istio-proxy
ports:
- containerPort: 15090
name: http-envoy-prom
protocol: TCP
readinessProbe:
failureThreshold: 30
httpGet:
path: /healthz/ready
port: 15020
scheme: HTTP
initialDelaySeconds: 1
periodSeconds: 2
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: "2"
memory: 1Gi
requests:
cpu: 100m
memory: 128Mi
securityContext:
procMount: Default
readOnlyRootFilesystem: true
runAsUser: 1337
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/istio/proxy
name: istio-envoy
- mountPath: /etc/certs/
name: istio-certs
readOnly: true
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: controller-token-2d485
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
initContainers:
- args:
- -p
- "15001"
- -u
- "1337"
- -m
- REDIRECT
- -i
- '*'
- -x
- ""
- -b
- 8012,8013,9090
- -d
- "15020"
image: icr.io/ext/istio/proxy_init:1.1.7
imagePullPolicy: IfNotPresent
name: istio-init
resources:
limits:
cpu: 100m
memory: 50Mi
requests:
cpu: 10m
memory: 10Mi
securityContext:
capabilities:
add:
- NET_ADMIN
procMount: Default
runAsNonRoot: false
runAsUser: 0
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
nodeName: 10.138.173.69
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: controller
serviceAccountName: controller
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- configMap:
defaultMode: 420
name: config-logging
name: config-logging
- configMap:
defaultMode: 420
name: config-observability
name: config-observability
- name: controller-token-2d485
secret:
defaultMode: 420
secretName: controller-token-2d485
- emptyDir:
medium: Memory
name: istio-envoy
- name: istio-certs
secret:
defaultMode: 420
optional: true
secretName: istio.controller
status:
conditions:
- lastProbeTime: null
lastTransitionTime: 2019-07-03T14:03:56Z
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: 2019-07-03T14:03:52Z
message: 'containers with unready status: [activator]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: 2019-07-03T14:03:52Z
message: 'containers with unready status: [activator]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: 2019-07-03T14:03:52Z
status: "True"
type: PodScheduled
containerStatuses:
- containerID: containerd://4d53f35b4b81f716e860456b5065abc2739562c5d5ec9540f537bcd0f8b4c1e9
image: sha256:bf56e322a8bfa210170c2bd500bca29258ed66dadc702dc12a6d97ef48b23286
imageID: gcr.io/knative-releases/github.com/knative/serving/cmd/activator@sha256:f553b6cb7599f2f71190ddc93024952e22f2f55e97a3f38519d4d622fc751651
lastState:
terminated:
containerID: containerd://4d53f35b4b81f716e860456b5065abc2739562c5d5ec9540f537bcd0f8b4c1e9
exitCode: 1
finishedAt: 2019-07-03T14:04:53Z
reason: Error
startedAt: 2019-07-03T14:04:53Z
name: activator
ready: false
restartCount: 3
state:
waiting:
message: Back-off 40s restarting failed container=activator pod=activator-7587c7475b-lwbw2_knative-serving(639b16de-9d9b-11e9-89a4-06f3e4907b6a)
reason: CrashLoopBackOff
- containerID: containerd://401bb101a73e778e1a6efe9fffffc2965d3837b3130dd5e6f703ea61e053f5dc
image: icr.io/ext/istio/proxyv2:1.1.7
imageID: icr.io/ext/istio/proxyv2@sha256:e6f039115c7d5ef9c8f6b049866fbf9b6f5e2255d3a733bb8756b36927749822
lastState: {}
name: istio-proxy
ready: true
restartCount: 0
state:
running:
startedAt: 2019-07-03T14:03:57Z
hostIP: 10.138.173.69
initContainerStatuses:
- containerID: containerd://b6b62a4ef4c9a4df595d24318bdac434356ff617374d27d25e7697226df9f190
image: icr.io/ext/istio/proxy_init:1.1.7
imageID: icr.io/ext/istio/proxy_init@sha256:9056ebb0757be99006ad568331e11bee99ae0daaa4459e7c15dfaf0e0cba2f48
lastState: {}
name: istio-init
ready: true
restartCount: 0
state:
terminated:
containerID: containerd://b6b62a4ef4c9a4df595d24318bdac434356ff617374d27d25e7697226df9f190
exitCode: 0
finishedAt: 2019-07-03T14:03:55Z
reason: Completed
startedAt: 2019-07-03T14:03:54Z
phase: Running
podIP: 172.30.158.33
qosClass: Burstable
startTime: 2019-07-03T14:03:52Z
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 18 (16 by maintainers)
Not failing in the presence of malformed configmaps seems like useful resilience, but I think I’d rather see us solve the first-order problem first, which is getting better synchronous validation of our configmaps.
I forget who I was talking to about this recently, but the tl;dr was: Let’s extend out webhook to validate our configmaps.
Today the key configuration our webhook takes is this map here: https://github.com/knative/serving/blob/c6f7bc48faa2c882539d56719d5b23fdd653ea5c/cmd/webhook/main.go#L108-L123
… which combined with
apis.Validatableandapis.Defaultableform the core of our webhook logic.I would propose that to synchronously validate ConfigMaps we also allow webhook consumers to provide a mapping like those used in our controllers’ respective stores e.g. here: https://github.com/knative/serving/blob/c6f7bc48faa2c882539d56719d5b23fdd653ea5c/pkg/reconciler/revision/config/store.go#L64-L71
The shared webhook infrastructure would then (when these are provided) register a validating webhook on ConfigMaps in
system.Namespace()and dispatch on the ConfigMap name to invoke the appropriate constructor. If it errors, then we would return the error and reject the configmap update synchronously.I think with such a defense in place, it will be much harder to get into the bad state discussed here, and much more obvious to folks that hit these guard rails what the problem is. @nimakaviani would you be up for tackling this? I think it’d be extremely useful.