operator-lifecycle-manager: operatorhub-catalog in crashloop backoff without clear error message

Bug Report

I quite frequently find the operatorhub-catalog process in a crashloop backoff with no useful information for debugging how it ended up in that state. It would be nice if we could get some more diagnostic information as to why this pod winds up in that state.

What did you do?

Deployed OLM using the provided helm template and wait for some time.

olm                     operatorhubio-catalog-t66gt                              0/1     CrashLoopBackOff             3848       8d

What did you expect to see?

I would expect the process to run healthy unless there was an issue. Upon encountering an issue, I would expect the log of operatorhub-catalog to communicate what is wrong with the pod instead of serving a single starting gRPC server message.

Environment

  • operator-lifecycle-manager version:
  • Kubernetes version information:
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.0", GitCommit:"2bd9643cee5b3b3a5ecbd3af49d09018f0773c77", GitTreeState:"clean", BuildDate:"2019-09-18T14:36:53Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T11:05:50Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes cluster kind: (openstack, bare metal, and aws)

Possible Solution

Additional context

Name:         operatorhubio-catalog-5jtrd
Namespace:    olm
Priority:     0
Node:         admin-kcp-primary-0/192.168.200.9
Start Time:   Fri, 25 Oct 2019 09:51:57 -0500
Labels:       olm.catalogSource=operatorhubio-catalog
Annotations:  cni.projectcalico.org/podIP: 10.42.1.2/32
              kubernetes.io/psp: default-psp
Status:       Running
IP:           10.42.1.2
IPs:          <none>
Containers:
  registry-server:
    Container ID:   docker://37233b694ec50f4963d23cd9447fd458a19cb3f36013ca53521a500e1fceba4d
    Image:          quay.io/operator-framework/upstream-community-operators:latest
    Image ID:       docker-pullable://quay.io/operator-framework/upstream-community-operators@sha256:95a59849ea594e97742264d66b80dcc2a8ac3515ff22cf64538b21101f345111
    Port:           50051/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Fri, 25 Oct 2019 09:55:03 -0500
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Fri, 25 Oct 2019 09:54:23 -0500
      Finished:     Fri, 25 Oct 2019 09:55:01 -0500
    Ready:          True
    Restart Count:  4
    Requests:
      cpu:      10m
      memory:   50Mi
    Liveness:   exec [grpc_health_probe -addr=localhost:50051] delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:  exec [grpc_health_probe -addr=localhost:50051] delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment:
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-5hwc7 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  default-token-5hwc7:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-5hwc7
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     
Events:
  Type     Reason     Age                    From                                Message
  ----     ------     ----                   ----                                -------
  Normal   Scheduled  4m1s                   default-scheduler                   Successfully assigned olm/operatorhubio-catalog-5jtrd to admin-kcp-primary-0
  Normal   Started    2m54s (x2 over 3m49s)  kubelet, admin-kcp-primary-0  Started container registry-server
  Warning  Unhealthy  2m20s (x6 over 3m20s)  kubelet, admin-kcp-primary-0  Readiness probe failed: timeout: failed to connect service "localhost:50051" within 1s
  Normal   Killing    2m17s (x2 over 2m57s)  kubelet, admin-kcp-primary-0  Container registry-server failed liveness probe, will be restarted
  Warning  Unhealthy  2m17s (x6 over 3m17s)  kubelet, admin-kcp-primary-0  Liveness probe failed: timeout: failed to connect service "localhost:50051" within 1s
  Normal   Pulling    2m16s (x3 over 3m55s)  kubelet, admin-kcp-primary-0  Pulling image "quay.io/operator-framework/upstream-community-operators:latest"
  Normal   Pulled     2m15s (x3 over 3m51s)  kubelet, admin-kcp-primary-0  Successfully pulled image "quay.io/operator-framework/upstream-community-operators:latest"
  Normal   Created    2m15s (x3 over 3m50s)  kubelet, admin-kcp-primary-0  Created container registry-server

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 2
  • Comments: 20 (2 by maintainers)

Most upvoted comments

No. We wound up uninstalling OLM and working with upstream helm-charts instead. Sorry.

@ecordell : here they are

$ kubectl logs -f -n olm operatorhubio-catalog-5jtrd 
time="2019-10-28T16:35:43Z" level=info msg="serving registry" database=/bundles.db port=50051