spark-on-k8s-operator: Spark operator generates timeout issues during pod creating after job execution on GKE

Summary The Spark operator generates timeout issues during pod creating after job execution on a GKE cluster (1.12.7-gke.24).

Steps to reproduce Install the operator

helm install incubator/sparkoperator --namespace spark --name spark-operator --set sparkJobNamespace=spark --set enable-metrics=true --set enableWebhook=true

Run a job containing a volume

apiVersion: "sparkoperator.k8s.io/v1beta1"
kind: SparkApplication
metadata:
  name: {{ item.name }}
  namespace: spark
spec:
  type: Scala
  mode: cluster
  image: "gcr.io/spark-operator/spark:v2.4.0-gcs-prometheus"
  imagePullPolicy: Always
  mainClass: nl.debijenkorf.core.SparkApplication
  mainApplicationFile: "gs://BUCKET/APPLICATION.jar"
  sparkVersion: "2.4.0"
  arguments: 
    - {{ item.job }}
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/spark/data"
        type: Directory  
  driver:
    {..}
  executor:
    {..}

Actual results

I0702 08:24:10.298358      11 controller.go:522] Trying to update SparkApplication spark/test-volume-claim, from: [{  2019-07-02 07:26:25 +0000 UTC 0001-01-01 00:00:00 +0000 UTC { 0    } {FAILED failed to run spark-submit for SparkApplication spark/test-volume-claim: Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create]  for kind: [Pod]  with name: [null]  in namespace: [spark]  failed.
	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:62)
	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:71)
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:363)
	at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$2.apply(KubernetesClientApplication.scala:141)
	at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$2.apply(KubernetesClientApplication.scala:140)
	at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543)
	at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:140)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:250)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:241)
	at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:241)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:204)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.SocketTimeoutException: timeout
	at okio.Okio$4.newTimeoutException(Okio.java:230)
	at okio.AsyncTimeout.exit(AsyncTimeout.java:285)
	at okio.AsyncTimeout$2.read(AsyncTimeout.java:241)
	at okio.RealBufferedSource.indexOf(RealBufferedSource.java:345)
	at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:217)
	at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:211)
	at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189)
	at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:75)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
	at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
	at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
	at io.fabric8.kubernetes.client.utils.HttpClientUtils$2.intercept(HttpClientUtils.java:93)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
	at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
	at okhttp3.RealCall.execute(RealCall.java:69)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:377)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:343)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:226)
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:769)
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:356)
	... 16 more
Caused by: java.net.SocketException: Socket closed
	at java.net.SocketInputStream.read(SocketInputStream.java:204)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
	at sun.security.ssl.InputRecord.read(InputRecord.java:503)
	at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983)
	at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:940)
	at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
	at okio.Okio$2.read(Okio.java:139)
	at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
	... 43 more

The mutating webhook seems to be there:

➜ kubectl get MutatingWebhookConfiguration
NAME                                          CREATED AT
spark-operator-sparkoperator-webhook-config   2019-07-02T07:24:40Z

Same aplies to the service:

➜ kgs -n spark
NAME                     TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)   AGE
spark-operator-webhook   ClusterIP   10.249.0.99   <none>        443/TCP   67m

Anyone has an idea what could be the issue?

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 28 (16 by maintainers)

Most upvoted comments

@zwennesm @doctapp I think you both ran into the same issue with webhooks on a private cluster. The default port for the webhook service is 8080, which seems not allowed for private clusters according to https://github.com/kubernetes/autoscaler/issues/1820. We do allow changing the port through the parameter webhookPort . Can you try doing the Helm installation with the following flag?

--set webhookPort=443