katib: MetricsUnavailable for random example experiment
/kind bug
What steps did you take and what happened: [A clear and concise description of what the bug is.] Installed Kubeflow on a clean EKS cluster using this guide https://www.kubeflow.org/docs/aws/deploy/install-kubeflow/
Submitted random-experiment
from UI.
After an hour there is still no status, logs or metrics.
Pods that were spawned by Trial’s Job have logs.
What did you expect to happen: Metrics to show on Katib UI. Experiment to finish.
Anything else you would like to add:
Pod logs:
INFO:root:Epoch[19] Batch [100] Speed: 27953.82 samples/sec accuracy=0.116646
INFO:root:Epoch[19] Batch [200] Speed: 24880.15 samples/sec accuracy=0.111406
INFO:root:Epoch[19] Batch [300] Speed: 23859.95 samples/sec accuracy=0.112344
INFO:root:Epoch[19] Batch [400] Speed: 27594.30 samples/sec accuracy=0.115937
INFO:root:Epoch[19] Batch [500] Speed: 18158.40 samples/sec accuracy=0.115312
INFO:root:Epoch[19] Batch [600] Speed: 26611.55 samples/sec accuracy=0.102188
INFO:root:Epoch[19] Batch [700] Speed: 27180.25 samples/sec accuracy=0.114687
INFO:root:Epoch[19] Batch [800] Speed: 27309.44 samples/sec accuracy=0.113906
INFO:root:Epoch[19] Batch [900] Speed: 26656.13 samples/sec accuracy=0.105313
INFO:root:Epoch[19] Train-accuracy=0.122044
INFO:root:Epoch[19] Time cost=2.383
INFO:root:Epoch[19] Validation-accuracy=0.113854
Description of a Trial:
Name: hptest-bt8nsw2z
Namespace: kubeflow
Labels: experiment=hptest
Annotations: <none>
API Version: kubeflow.org/v1alpha3
Kind: Trial
Metadata:
Creation Timestamp: 2020-01-28T16:18:39Z
Finalizers:
clean-metrics-in-db
Generation: 1
Owner References:
API Version: kubeflow.org/v1alpha3
Block Owner Deletion: true
Controller: true
Kind: Experiment
Name: hptest
UID: ab0e3811-41e9-11ea-a0cf-0a9ff0751f4a
Resource Version: 127230
Self Link: /apis/kubeflow.org/v1alpha3/namespaces/kubeflow/trials/hptest-bt8nsw2z
UID: d810a019-41e9-11ea-a0cf-0a9ff0751f4a
Spec:
Metrics Collector:
Objective:
Additional Metric Names:
accuracy
Goal: 0.99
Objective Metric Name: Validation-accuracy
Type: maximize
Parameter Assignments:
Name: --lr
Value: 0.020744080613308936
Name: --num-layers
Value: 3
Name: --optimizer
Value: sgd
Run Spec: apiVersion: batch/v1
kind: Job
metadata:
name: hptest-bt8nsw2z
namespace: kubeflow
spec:
template:
spec:
containers:
- name: hptest-bt8nsw2z
image: docker.io/katib/mxnet-mnist-example
command:
- "python"
- "/mxnet/example/image-classification/train_mnist.py"
- "--batch-size=64"
- "--lr=0.020744080613308936"
- "--num-layers=3"
- "--optimizer=sgd"
restartPolicy: Never
Status:
Conditions:
Last Transition Time: 2020-01-28T16:18:39Z
Last Update Time: 2020-01-28T16:18:39Z
Message: Trial is created
Reason: TrialCreated
Status: True
Type: Created
Last Transition Time: 2020-01-28T16:19:43Z
Last Update Time: 2020-01-28T16:19:43Z
Message: Trial is running
Reason: TrialRunning
Status: False
Type: Running
Last Transition Time: 2020-01-28T16:19:43Z
Last Update Time: 2020-01-28T16:19:43Z
Message: Metrics are not available
Reason: MetricsUnavailable
Status: False
Type: Succeeded
Start Time: 2020-01-28T16:18:39Z
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning MetricsUnavailable 31m (x2 over 139m) trial-controller Metrics are not available for Job hptest-bt8nsw2z
Environment:
- Kubeflow version: 7.1 from https://raw.githubusercontent.com/kubeflow/manifests/v0.7-branch/kfdef/kfctl_aws.0.7.1.yaml
- Minikube version: N/A. Deployed on EKS.
- Kubernetes version: (use
kubectl version
): version.Info{Major:“1”, Minor:“14+”, GitVersion:“v1.14.9-eks-c0eccc”, GitCommit:“c0eccca51d7500bb03b2f163dd8d534ffeb2f7a2”, GitTreeState:“clean”, BuildDate:“2019-12-22T23:14:11Z”, GoVersion:“go1.12.12”, Compiler:“gc”, Platform:“linux/amd64”} - OS (e.g. from
/etc/os-release
): Amazon Linux (?)
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 2
- Comments: 28 (9 by maintainers)
Can you try to do this:
kubectl edit namespace kubeflow
control-plane=kubeflow
It maybe fixes the problem with MetricsCollector. You don’t need to make any changes to Katib examples.
sorry @andreyvelich I misspecified the objective metric name and that was what was causing the issue. Thanks for the help and the rapid responses.
Did the namespace has label katib-metricscollector-injection: enabled?
It seems that the metrics collector is not injected successfully.
/cc @hougangliu @johnugeorge
@andreyvelich
I very much doubt this is the case as all the example yaml Experiments were broken for me until I added
You should consider fixing the issue or updating the example manifests.
Hello, the same behavior is happening with the bayesian optimization example, metrics aren’t being reported so the experiment gets stuck after running the first three parallel trials.
Experiment Description:
Pod Description: