spark-on-k8s-operator: Cannot launch driver after Spark default CPU value to int32

Hey there!

Firstly thank you for everything you are trying to achieve. When trying to carve your own path, projects like this surely make for great ways for new players in the data engineering space to get started running clusters and forge great data experiences 😃

Here’s the problem…

I’ve noticed that the Apache Spark on k8s specification has changed the no. CPUs from a float to an integer. With the release of the latest API version this has been reflected https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/pull/578.

Fairly new to kubernates but this seems to create a conflict with the driver runner where I’m experiencing…

Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.96.0.1/api/v1/namespaces/spark-operator/pods. Message: Pod "spark-pi-driver" is invalid: spec.containers[0].resources.requests: Invalid value: "1": must be less than or equal to cpu limit. Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[StatusCause(field=spec.containers[0].resources.requests, message=Invalid value: "1": must be less than or equal to cpu limit, reason=FieldValueInvalid, additionalProperties={})], group=null, kind=Pod, name=spark-pi-driver, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Pod "spark-pi-driver" is invalid: spec.containers[0].resources.requests: Invalid value: "1": must be less than or equal to cpu limit, metadata=ListMeta(_continue=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, status=Failure, additionalProperties={}).

This is likely because Invalid value: "1": must be less than or equal to cpu limit.. So our new minimum value 1 seems to be the default value. But the error indicates that the CPU limit for Kubernates must be 1. Perhaps this is related to this issue https://github.com/kubernetes/kubernetes/issues/51430. Not sure how to resolve this, may be my Kubernetes configuration at fault here more than anything.

Environment

Here is the environment I’m running (from the pod)…

λ kubectl describe pod spark-sparkoperator-7c6d6f9cfd-6257n
Name:               spark-sparkoperator-7c6d6f9cfd-6257n
Namespace:          spark-operator
Priority:           0
PriorityClassName:  <none>
Node:               docker-desktop/192.168.65.3
Start Time:         Sun, 08 Dec 2019 01:10:57 +0000
Labels:             app.kubernetes.io/name=sparkoperator
                    app.kubernetes.io/version=v1beta2-1.0.1-2.4.4
                    pod-template-hash=7c6d6f9cfd
Annotations:        prometheus.io/path: /metrics
                    prometheus.io/port: 10254
                    prometheus.io/scrape: true
Status:             Running
IP:                 10.1.0.9
Controlled By:      ReplicaSet/spark-sparkoperator-7c6d6f9cfd
Containers:
  sparkoperator:
    Container ID:  docker://96d7a6908bad62e35fcfd530ca5337073a27602c468f2b7580f65cce4c48fd38
    Image:         gcr.io/spark-operator/spark-operator:v1beta2-1.0.1-2.4.4
    Image ID:      docker-pullable://gcr.io/spark-operator/spark-operator@sha256:ce769e5c6a5d8fa78ceb1a0abaf961fb2424767f9535c97baac04a18169654bd
    Port:          10254/TCP
    Host Port:     0/TCP
    Args:
      -v=2
      -namespace=
      -ingress-url-format=
      -controller-threads=10
      -resync-interval=30
      -logtostderr
      -enable-metrics=true
      -metrics-labels=app_type
      -metrics-port=10254
      -metrics-endpoint=/metrics
      -metrics-prefix=
    State:          Running
      Started:      Sun, 08 Dec 2019 01:10:58 +0000
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from spark-sparkoperator-token-w7dmr (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  spark-sparkoperator-token-w7dmr:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  spark-sparkoperator-token-w7dmr
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age    From                     Message
  ----    ------     ----   ----                     -------
  Normal  Scheduled  3m11s  default-scheduler        Successfully assigned spark-operator/spark-sparkoperator-7c6d6f9cfd-6257n to docker-desktop
  Normal  Pulled     3m10s  kubelet, docker-desktop  Container image "gcr.io/spark-operator/spark-operator:v1beta2-1.0.1-2.4.4" already present on machine
  Normal  Created    3m10s  kubelet, docker-desktop  Created container sparkoperator
  Normal  Started    3m10s  kubelet, docker-desktop  Started container sparkoperator

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 1
Comments: 17 (8 by maintainers)

Most upvoted comments

This is due to a recent change in https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/pull/578 that introduced the v1beta2 version of the API. The change was to change the type of .spec.driver.cores to an integer to be consistent with the Spark config property spark.driver.cores which is an integer. Spark 3.0 will have a new config property spark.kubernetes.driver.request.cores for setting the CPU request for the driver pod. We will add support for that soon. This new config property supports Kubernetes-comformant values, e.g., 0.1 and 100m and is used for specifying the CPU request of the driver pod, independently from spark.driver.cores that .spec.driver.cores maps to.

liyinan926 on Dec 19, 2019