emr-serverless-samples: The suggested way of using Python libraries with EMR Serverless does not work

As detailed here and here we should be able to install and use a python venv:

--conf spark.archives=s3://DOC-EXAMPLE-BUCKET/EXAMPLE-PREFIX/pyspark_venv.tar.gz#environment 
--conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python
--conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python 
--conf spark.emr-serverless.executorEnv.PYSPARK_PYTHON=./environment/bin/python

but that doesn’t seem to work. The application fails with this error:

Unpacking an archive s3://DOC-EXAMPLE-BUCKET/EXAMPLE-PREFIX/pyspark_venv.tar.gz#environment from /tmp/spark-02908b0e-9b64-469d-xxx-xxxxxxxx/pyspark_venv.tar.gz to /home/hadoop/./environment
Exception in thread "main" java.io.IOException: Cannot run program "./environment/bin/python": error=2, No such file or directory
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
	at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:105)
	at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1003)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1092)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1101)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: error=2, No such file or directory
	at java.lang.UNIXProcess.forkAndExec(Native Method)
	at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
	at java.lang.ProcessImpl.start(ProcessImpl.java:134)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
	... 14 more
22/07/20 05:36:18 INFO ShutdownHookManager: Shutdown hook called
22/07/20 05:36:18 INFO ShutdownHookManager: Deleting directory /tmp/spark-02908b0e-9b64-469d-b094-edee291a2426

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 24 (10 by maintainers)

Most upvoted comments

No module named 'boto3', so have to included it in the tar.gz. Should’nt that be already available to import?

If you’re using a custom python version, you likely need to install boto3. I believe boto3 is available in the base image, but with a custom python version, I’m guessing it also uses the libraries installed only for that version.

FYI, you can now also use custom images with EMR Serverless, which might make installing dependencies a bit easier. 😃 See the build a data science image for details and the docs on how to use custom images.

Note that the image url in the first link is slightly incorrect - it should be spark not spark3 - we’re updating the docs on that shortly

dacort on Jan 9, 2023

Thanks for following up. I do think we need to make it more clear in the docs - I’ve had other folks run into similar issues.

dacort on Jul 21, 2022