spark-on-k8s-operator: What I am doing wrong

I am trying to run a Pyspark application using operator. I can run it perfectly if I backed the python application in spark image but when I am trying to get them from s3, I am getting into all sort issue. Please advise what I am doing wrong:

Here is my YAML File

apiVersion: "sparkoperator.k8s.io/v1beta1"
kind: SparkApplication
metadata:
  name: generic-pyspark2.4.4
  namespace: random
spec:
  type: Python
  pythonVersion: "3"
  mode: cluster
  image: 'pyspark-2.4.4-hadoop-2.7:v0.1"
  imagePullPolicy: Always
  class: org.apache.spark.deploy.PythonRunner
  mainApplicationFile: "s3a://buckets/pyspark/model.py"
  sparkConf:
    "spark.hadoop.fs.s3a.aws.credentials.provider": com.amazonaws.auth.InstanceProfileCredentialsProvider
    "spark.hadoop.fs.s3a.impl": org.apache.hadoop.fs.s3a.S3AFileSystem
    "spark.shuffle.service.enabled": false
    "spark.speculation": false
  deps:
    pyFiles:
      - "s3a://buckets/pyspark/aws_utils.py"
      - "s3a://buckets/pyspark/dataset.py"
  sparkVersion: "2.4.4"
  driver:
    cores: 2
    # coreLimit: "1200m"
    memory: "1024m"
    labels:
      version: 2.4.4
    serviceAccount: sparkoperator
  executor:
    cores: 4
    instances: 5
    memory: "10240m"
    labels:
      version: 2.4.4

I am using

spark - 2.4.4
aws-java-sdk-1.7.3.jar
hadoop-aws-2.7.3.jar
scala: 2.11

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 17 (8 by maintainers)

Most upvoted comments

Ahh…that’s exactly what it was…when I updated my helm installation command to provide a pre-existing service account with IAM role attached to it, it worked fine. Thanks a ton for the guidance @bbenzikry. Very grateful 😃. Cheers 😃

batCoder95 on Nov 23, 2020

@JunaidChaudry You can take a look at https://github.com/bbenzikry/spark-eks/blob/main/docker/spark3.Dockerfile for a reference

bbenzikry on Feb 18, 2021