kubeflow: Can not launch TensorFlow Serving because AVX not available on VM

Hi all,

I followed user guide try to serve-a-model-using-tensorflow-serving, but tf-serving pod can’t be created. I setup kubeflow in my own kubernetes cluster, not using GKE or minikube. How can I fix this error ? Because there is no logs in Pod, I cannot show provide other useful information. Thanks.

root@vagrant:~# kubectl get pod
NAME                              READY     STATUS             RESTARTS   AGE
ambassador-6ccb864c46-268xt       2/2       Running            0          10m
ambassador-6ccb864c46-9hrsh       2/2       Running            0          10m
ambassador-6ccb864c46-m7gc7       2/2       Running            0          10m
inception-858476d4c4-cr49s        0/1       CrashLoopBackOff   6          10m
tf-hub-0                          1/1       Running            0          10m
tf-job-operator-78757955b-nk457   1/1       Running            0          10m
Name:           inception-858476d4c4-cr49s
Namespace:      default
Node:           192.168.2.21/192.168.2.21
Start Time:     Wed, 14 Mar 2018 18:33:26 +0000
Labels:         app=inception
                pod-template-hash=4140328070
Annotations:    <none>
Status:         Running
IP:             172.20.0.152
Controlled By:  ReplicaSet/inception-858476d4c4
Containers:
  inception:
    Container ID:  docker://f59a9db5a3cf6f2e33671ba30ff71d3e34e32e1835ab908c4250f3c7f93f8c75
    Image:         gcr.io/kubeflow-images-staging/tf-model-server:v20180227-master
    Image ID:      docker-pullable://gcr.io/kubeflow-images-staging/tf-model-server@sha256:07ded66bc3a8e5ca582c3dc1871c2636a4ebc06b082e3afd76a68c45f1953385
    Port:          9000/TCP
    Args:
      /usr/bin/tensorflow_model_server
      --port=9000
      --model_name=inception
      --model_base_path=gs://kubeflow-models/inception
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    132
      Started:      Wed, 14 Mar 2018 18:39:24 +0000
      Finished:     Wed, 14 Mar 2018 18:39:24 +0000
    Ready:          False
    Restart Count:  6
    Limits:
      cpu:     4
      memory:  4Gi
    Requests:
      cpu:        1
      memory:     1Gi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-pb8qz (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          False
  PodScheduled   True
Volumes:
  default-token-pb8qz:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-pb8qz
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     <none>
Events:
  Type     Reason                 Age               From                   Message
  ----     ------                 ----              ----                   -------
  Normal   Scheduled              9m                default-scheduler      Successfully assigned inception-858476d4c4-cr49s to 192.168.2.21
  Normal   SuccessfulMountVolume  9m                kubelet, 192.168.2.21  MountVolume.SetUp succeeded for volume "default-token-pb8qz"
  Normal   Pulled                 7m (x5 over 9m)   kubelet, 192.168.2.21  Container image "gcr.io/kubeflow-images-staging/tf-model-server:v20180227-master" already present on machine
  Normal   Created                7m (x5 over 9m)   kubelet, 192.168.2.21  Created container
  Normal   Started                7m (x5 over 9m)   kubelet, 192.168.2.21  Started container
  Warning  BackOff                4m (x24 over 9m)  kubelet, 192.168.2.21  Back-off restarting failed container

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 16 (14 by maintainers)

Commits related to this issue

Most upvoted comments

Thanks for your useful suggestion.

After I read tf-serving Dockerfile I noted tf-serving install tensorflow-model-server. I also read the tf-serving install guide, there is one note:

Note: In the above commands, replace tensorflow-model-server with tensorflow-model-server-universal if your processor does not support AVX instructions.

I found my vm does’nt have avx instruction, so I replace tensorflow-model-server with tensorflow-model-server-universal in Dockerfile and re-build new images. Then tf-serving Pod can be created.

...
RUN apt-get update && apt-get install -y \
        tensorflow-model-server-universal  && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*
...
root@vagrant:~/my-kubeflow# kubectl get pod
NAME                         READY     STATUS    RESTARTS   AGE
inception-7f78798dd8-7rlnc   1/1       Running   0          3s