TensorFlowOnSpark: Job failing on CDH 5.8.2 - Executor Heartbeat timing out

I’m running the below command: spark-submit --master yarn --deploy-mode client --queue cpu --num-executors 2 --executor-memory 4G --py-files TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --conf spark.executor.heartbeatInterval=1200s --conf spark.executorEnv.LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib:/usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server" TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images hdfs:///user/mayub/mnist/csv/train/images --labels hdfs:///user/mayub/mnist/csv/train/labels --mode train --model hdfs:///user/mayub//mnist/mnist_model2

After waiting for a while the job fails with the below error; “Removing executor 2 with no recent heartbeats: 172657 ms exceeds timeout 120000 ms”

I have tried running this in couple of ways:

Option 1: Client mode Variables - $LIB_HDFS = /opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/ $LIB_JVM= /usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server

spark-submit --master yarn --deploy-mode client --queue cpu --num-executors 2 --executor-memory 4G --py-files TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --conf spark.executor.heartbeatInterval=1200s --conf spark.executorEnv.LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib:/usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server" TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images hdfs:///user/mayub/mnist/csv/train/images --labels hdfs:///user/mayub/mnist/csv/train/labels --mode train --model hdfs:///user/mayub//mnist/mnist_model2

Error log File: “Executor Heartbeat timing out” spark_client_mode.txt

Option 2: cluster mode Variables - $LIB_HDFS = /opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/ $LIB_JVM= /usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server

Command: spark-submit --master yarn --deploy-mode cluster --queue cpu --num-executors 1 --executor-memory 2G --py-files TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_JVM:$LIB_HDFS TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images hdfs:///user/mayub/mnist/csv/train/images --labels hdfs:///user/mayub/mnist/csv/train/labels --mode train --model hdfs:///user/mayub//mnist/mnist_model2

Error Log File: “Job just hangs and then fails after running for a while while”

spark_hanging_job.txt

Option 3: with additional Cloudera configs

Variables - $LIB_HDFS = /opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/ $LIB_JVM= /usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server

Command: spark-submit --master yarn --deploy-mode client --queue cpu --num-executors 2 --executor-memory 4G --py-files TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --conf spark.executor.heartbeatInterval=1200s --conf spark.executorEnv.LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib:/usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server" TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images hdfs:///user/mayub/mnist/csv/train/images --labels hdfs:///user/mayub/mnist/csv/train/labels --mode train --model hdfs:///user/mayub//mnist/mnist_model2 Error- same error as option 1.

Just to validate that if other spark-submit jobs are running fine. I ran a spark word count example and it ran fine.

Appreciate any help.

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 16 (8 by maintainers)

Most upvoted comments

Just to summarize the issue and resolution for future reference. I removed the following configurations from query as --num-executors 4 implicitly sets dynamic allocation to false. --conf spark.dynamicAllocation.enabled=false --conf spark.dynamicAllocation.maxExecutors=4 --conf spark.dynamicAllocation.minExecutors=4

Also, removed absolute NN pathname from train, test, model and output HDFS directories as relative path names resolve successfully.

I had to tune the following parameters according to the setting in my Cloudera cluster (please don’t blindly use the ones from the example) --executor-memory 8G --driver-memory 4G --conf spark.yarn.executor.memoryOverhead=1600 --conf spark.yarn.driver.memoryOverhead=720

Add permissions to model output folder. $hdfs dfs -chmod 777 /user/mayub/mnist/mnist_model

Here is the final query that worked in both ‘client’ and ‘cluster’ mode.

Training: spark-submit --master yarn --deploy-mode cluster --queue cpu --num-executors 4 --executor-memory 8G --driver-memory 4G --py-files TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py --conf spark.yarn.maxAppAttempts=1 --conf spark.yarn.executor.memoryOverhead=1600 --conf spark.yarn.driver.memoryOverhead=720 --conf spark.executorEnv.LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib:/usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server" TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images /user/mayub/mnist/csv/test/images --labels /user/mayub/mnist/csv/test/labels --mode train --model /user/mayub/mnist/mnist_model

Inference spark-submit --master yarn --deploy-mode cluster --queue cpu --num-executors 4 --executor-memory 8G --driver-memory 4G --py-files TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py --conf spark.yarn.executor.memoryOverhead=1600 --conf spark.yarn.driver.memoryOverhead=720 --conf spark.executorEnv.LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib:/usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server" TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images /user/mayub/mnist/csv/test/images --labels /user/mayub/mnist/csv/test/labels --mode inference --model /user/mayub/mnist/mnist_model --output /user/mayub/mnist/predictions

@leewyang Thanks for your help. I’ll go ahead and close this, but would like your thoughts on my earlier comment on ‘yarn’ user.

mohammedayub44 on Jun 27, 2018