katib: MetricsUnavailable for random example experiment

/kind bug

What steps did you take and what happened: [A clear and concise description of what the bug is.] Installed Kubeflow on a clean EKS cluster using this guide https://www.kubeflow.org/docs/aws/deploy/install-kubeflow/

Submitted random-experiment from UI.

After an hour there is still no status, logs or metrics.

Pods that were spawned by Trial’s Job have logs.

What did you expect to happen: Metrics to show on Katib UI. Experiment to finish.

Anything else you would like to add:

Pod logs:

INFO:root:Epoch[19] Batch [100]	Speed: 27953.82 samples/sec	accuracy=0.116646
INFO:root:Epoch[19] Batch [200]	Speed: 24880.15 samples/sec	accuracy=0.111406
INFO:root:Epoch[19] Batch [300]	Speed: 23859.95 samples/sec	accuracy=0.112344
INFO:root:Epoch[19] Batch [400]	Speed: 27594.30 samples/sec	accuracy=0.115937
INFO:root:Epoch[19] Batch [500]	Speed: 18158.40 samples/sec	accuracy=0.115312
INFO:root:Epoch[19] Batch [600]	Speed: 26611.55 samples/sec	accuracy=0.102188
INFO:root:Epoch[19] Batch [700]	Speed: 27180.25 samples/sec	accuracy=0.114687
INFO:root:Epoch[19] Batch [800]	Speed: 27309.44 samples/sec	accuracy=0.113906
INFO:root:Epoch[19] Batch [900]	Speed: 26656.13 samples/sec	accuracy=0.105313
INFO:root:Epoch[19] Train-accuracy=0.122044
INFO:root:Epoch[19] Time cost=2.383
INFO:root:Epoch[19] Validation-accuracy=0.113854

Description of a Trial:

Name:         hptest-bt8nsw2z
Namespace:    kubeflow
Labels:       experiment=hptest
Annotations:  <none>
API Version:  kubeflow.org/v1alpha3
Kind:         Trial
Metadata:
  Creation Timestamp:  2020-01-28T16:18:39Z
  Finalizers:
    clean-metrics-in-db
  Generation:  1
  Owner References:
    API Version:           kubeflow.org/v1alpha3
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Experiment
    Name:                  hptest
    UID:                   ab0e3811-41e9-11ea-a0cf-0a9ff0751f4a
  Resource Version:        127230
  Self Link:               /apis/kubeflow.org/v1alpha3/namespaces/kubeflow/trials/hptest-bt8nsw2z
  UID:                     d810a019-41e9-11ea-a0cf-0a9ff0751f4a
Spec:
  Metrics Collector:
  Objective:
    Additional Metric Names:
      accuracy
    Goal:                   0.99
    Objective Metric Name:  Validation-accuracy
    Type:                   maximize
  Parameter Assignments:
    Name:    --lr
    Value:   0.020744080613308936
    Name:    --num-layers
    Value:   3
    Name:    --optimizer
    Value:   sgd
  Run Spec:  apiVersion: batch/v1
kind: Job
metadata:
  name: hptest-bt8nsw2z
  namespace: kubeflow
spec:
  template:
    spec:
      containers:
      - name: hptest-bt8nsw2z
        image: docker.io/katib/mxnet-mnist-example
        command:
        - "python"
        - "/mxnet/example/image-classification/train_mnist.py"
        - "--batch-size=64"
        - "--lr=0.020744080613308936"
        - "--num-layers=3"
        - "--optimizer=sgd"
      restartPolicy: Never
Status:
  Conditions:
    Last Transition Time:  2020-01-28T16:18:39Z
    Last Update Time:      2020-01-28T16:18:39Z
    Message:               Trial is created
    Reason:                TrialCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2020-01-28T16:19:43Z
    Last Update Time:      2020-01-28T16:19:43Z
    Message:               Trial is running
    Reason:                TrialRunning
    Status:                False
    Type:                  Running
    Last Transition Time:  2020-01-28T16:19:43Z
    Last Update Time:      2020-01-28T16:19:43Z
    Message:               Metrics are not available
    Reason:                MetricsUnavailable
    Status:                False
    Type:                  Succeeded
  Start Time:              2020-01-28T16:18:39Z
Events:
  Type     Reason              Age                 From              Message
  ----     ------              ----                ----              -------
  Warning  MetricsUnavailable  31m (x2 over 139m)  trial-controller  Metrics are not available for Job hptest-bt8nsw2z

Environment:

Kubeflow version: 7.1 from https://raw.githubusercontent.com/kubeflow/manifests/v0.7-branch/kfdef/kfctl_aws.0.7.1.yaml
Minikube version: N/A. Deployed on EKS.
Kubernetes version: (use kubectl version): version.Info{Major:“1”, Minor:“14+”, GitVersion:“v1.14.9-eks-c0eccc”, GitCommit:“c0eccca51d7500bb03b2f163dd8d534ffeb2f7a2”, GitTreeState:“clean”, BuildDate:“2019-12-22T23:14:11Z”, GoVersion:“go1.12.12”, Compiler:“gc”, Platform:“linux/amd64”}
OS (e.g. from /etc/os-release): Amazon Linux (?)

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 2
Comments: 28 (9 by maintainers)

Most upvoted comments

Can you try to do this:

kubectl edit namespace kubeflow
Delete label control-plane=kubeflow
Save changes
Run examples again

It maybe fixes the problem with MetricsCollector. You don’t need to make any changes to Katib examples.

andreyvelich on Feb 4, 2020

sorry @andreyvelich I misspecified the objective metric name and that was what was causing the issue. Thanks for the help and the rapid responses.

jimmy-hawkfish on Aug 21, 2020

Did the namespace has label katib-metricscollector-injection: enabled?

It seems that the metrics collector is not injected successfully.

/cc @hougangliu @johnugeorge

gaocegege on Feb 4, 2020

@andreyvelich

Katib controller should add StdOut Metrics collector if you didn’t specify it in the Experiment yaml file.

I very much doubt this is the case as all the example yaml Experiments were broken for me until I added

  metricsCollectorSpec:
    collector:
      kind: StdOut

You should consider fixing the issue or updating the example manifests.

timothyjlaurent on Feb 4, 2020

Hello, the same behavior is happening with the bayesian optimization example, metrics aren’t being reported so the experiment gets stuck after running the first three parallel trials.

Experiment Description:

Name:         bayesianoptimization-example
Namespace:    kubeflow
Labels:       controller-tools.k8s.io=1.0
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"kubeflow.org/v1alpha3","kind":"Experiment","metadata":{"annotations":{},"labels":{"controller-tools.k8s.io":"1.0"},"name":"...
API Version:  kubeflow.org/v1alpha3
Kind:         Experiment
Metadata:
  Creation Timestamp:  2020-01-30T17:43:22Z
  Finalizers:
    update-prometheus-metrics
  Generation:        1
  Resource Version:  21203
  Self Link:         /apis/kubeflow.org/v1alpha3/namespaces/kubeflow/experiments/bayesianoptimization-example
  UID:               02c79849-4388-11ea-9d91-42010a8e0025
Spec:
  Algorithm:
    Algorithm Name:  bayesianoptimization
    Algorithm Settings:
      Name:                random_state
      Value:               10
  Max Failed Trial Count:  3
  Max Trial Count:         12
  Objective:
    Additional Metric Names:
      Train-accuracy
    Goal:                   0.99
    Objective Metric Name:  Validation-accuracy
    Type:                   maximize
  Parallel Trial Count:     3
  Parameters:
    Feasible Space:
      Max:           0.03
      Min:           0.01
    Name:            --lr
    Parameter Type:  double
    Feasible Space:
      Max:           5
      Min:           2
    Name:            --num-layers
    Parameter Type:  int
    Feasible Space:
      List:
        sgd
        adam
        ftrl
    Name:            --optimizer
    Parameter Type:  categorical
  Trial Template:
    Go Template:
      Raw Template:  apiVersion: batch/v1
kind: Job
metadata:
  name: {{.Trial}}
  namespace: {{.NameSpace}}
spec:
  template:
    spec:
      containers:
      - name: {{.Trial}}
        image: docker.io/kubeflowkatib/mxnet-mnist
        command:
        - "python3"
        - "/opt/mxnet-mnist/mnist.py"
        - "--batch-size=64"
        {{- with .HyperParameters}}
        {{- range .}}
        - "{{.Name}}={{.Value}}"
        {{- end}}
        {{- end}}
      restartPolicy: Never
Status:
  Conditions:
    Last Transition Time:  2020-01-30T17:43:22Z
    Last Update Time:      2020-01-30T17:43:22Z
    Message:               Experiment is created
    Reason:                ExperimentCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2020-01-30T17:45:10Z
    Last Update Time:      2020-01-30T17:45:10Z
    Message:               Experiment is running
    Reason:                ExperimentRunning
    Status:                True
    Type:                  Running
  Current Optimal Trial:
    Observation:
      Metrics:              <nil>
    Parameter Assignments:  <nil>
  Start Time:               2020-01-30T17:43:22Z
  Trials:                   3
  Trials Pending:           3

Pod Description:

kubectl describe pod bayesianoptimization-example-mpn7zqlq-vflb7 -n kubeflow

Name:               bayesianoptimization-example-mpn7zqlq-vflb7
Namespace:          kubeflow
Priority:           0
PriorityClassName:  <none>
Node:               gke-kubeflow-app-e88-kubeflow-app-e88-8584f473-0r1h/10.142.15.208
Start Time:         Thu, 30 Jan 2020 10:45:10 -0700
Labels:             controller-uid=42fd0894-4388-11ea-9d91-42010a8e0025
                    job-name=bayesianoptimization-example-mpn7zqlq
Annotations:        <none>
Status:             Succeeded
IP:                 10.12.0.35
Controlled By:      Job/bayesianoptimization-example-mpn7zqlq
Containers:
  bayesianoptimization-example-mpn7zqlq:
    Container ID:  docker://95d4137b7a872329a98eb09bede346722c0591a8606defa98f3b8e974c5963f8
    Image:         docker.io/kubeflowkatib/mxnet-mnist
    Image ID:      docker-pullable://kubeflowkatib/mxnet-mnist@sha256:85e62e489033dd327e5db7322b636db15b7fe6b380c5846093926c66afb39d8a
    Port:          <none>
    Host Port:     <none>
    Command:
      python3
      /opt/mxnet-mnist/mnist.py
      --batch-size=64
      --lr=0.013884981186857928
      --num-layers=3
      --optimizer=adam
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 30 Jan 2020 10:45:27 -0700
      Finished:     Thu, 30 Jan 2020 10:47:46 -0700
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-c8fcw (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  default-token-c8fcw:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-c8fcw
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age   From                                                          Message
  ----    ------     ----  ----                                                          -------
  Normal  Scheduled  10m   default-scheduler                                             Successfully assigned kubeflow/bayesianoptimization-example-mpn7zqlq-vflb7 to gke-kubeflow-app-e88-kubeflow-app-e88-8584f473-0r1h
  Normal  Pulling    10m   kubelet, gke-kubeflow-app-e88-kubeflow-app-e88-8584f473-0r1h  Pulling image "docker.io/kubeflowkatib/mxnet-mnist"
  Normal  Pulled     10m   kubelet, gke-kubeflow-app-e88-kubeflow-app-e88-8584f473-0r1h  Successfully pulled image "docker.io/kubeflowkatib/mxnet-mnist"
  Normal  Created    10m   kubelet, gke-kubeflow-app-e88-kubeflow-app-e88-8584f473-0r1h  Created container bayesianoptimization-example-mpn7zqlq
  Normal  Started    10m   kubelet, gke-kubeflow-app-e88-kubeflow-app-e88-8584f473-0r1h  Started container bayesianoptimization-example-mpn7zqlq

martham93 on Jan 30, 2020