TensorFlowOnSpark: Getting stuck in TSoF on a Spark Standalone cluster

Hi, all. I am new to TSoF and I get stuck in it on standalone cluster for a week. After I run converting step I got CSV on my HDFS. However, when I run MNIST training step like this,

${SPARK_HOME}/bin/spark-submit \
--master spark://zeka-virtual-machine:7077 \
--py-files ${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \
--conf spark.cores.max=1 \
--conf spark.task.cpus=1 \
--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \
${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \
--cluster_size 1 \
--images examples/mnist/csv/train/images \
--labels examples/mnist/csv/train/labels \
--format csv \
--mode train \
--model mnist_model

I got the error message and I was wondering how I can fix it:

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/11/13 17:58:37 WARN util.Utils: Your hostname, zeka-virtual-machine resolves to a loopback address: 127.0.1.1; using 172.16.99.129 instead (on interface ens33)
17/11/13 17:58:37 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/11/13 17:58:41 INFO spark.SparkContext: Running Spark version 2.1.2
17/11/13 17:58:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/11/13 17:58:42 INFO spark.SecurityManager: Changing view acls to: hadoop
17/11/13 17:58:42 INFO spark.SecurityManager: Changing modify acls to: hadoop
17/11/13 17:58:42 INFO spark.SecurityManager: Changing view acls groups to: 
17/11/13 17:58:42 INFO spark.SecurityManager: Changing modify acls groups to: 
17/11/13 17:58:42 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
17/11/13 17:58:42 INFO util.Utils: Successfully started service 'sparkDriver' on port 46877.
17/11/13 17:58:42 INFO spark.SparkEnv: Registering MapOutputTracker
17/11/13 17:58:42 INFO spark.SparkEnv: Registering BlockManagerMaster
17/11/13 17:58:42 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/11/13 17:58:42 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/11/13 17:58:42 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-6bc14546-4631-404e-b5bb-bdbfc64ec142
17/11/13 17:58:42 INFO memory.MemoryStore: MemoryStore started with capacity 413.9 MB
17/11/13 17:58:42 INFO spark.SparkEnv: Registering OutputCommitCoordinator
17/11/13 17:58:43 INFO util.log: Logging initialized @6308ms
17/11/13 17:58:43 INFO server.Server: jetty-9.2.z-SNAPSHOT
17/11/13 17:58:43 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3ffe09eb{/jobs,null,AVAILABLE,@Spark}
17/11/13 17:58:43 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1e2b72d4{/jobs/json,null,AVAILABLE,@Spark}
17/11/13 17:58:43 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@655c03{/jobs/job,null,AVAILABLE,@Spark}
17/11/13 17:58:43 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@59f910ed{/jobs/job/json,null,AVAILABLE,@Spark}
17/11/13 17:58:43 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1607eb68{/stages,null,AVAILABLE,@Spark}
17/11/13 17:58:43 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@53ff96ae{/stages/json,null,AVAILABLE,@Spark}
17/11/13 17:58:43 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@79504d96{/stages/stage,null,AVAILABLE,@Spark}
17/11/13 17:58:43 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@683f5429{/stages/stage/json,null,AVAILABLE,@Spark}
17/11/13 17:58:43 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@41cabeed{/stages/pool,null,AVAILABLE,@Spark}
17/11/13 17:58:43 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6eaf1a9c{/stages/pool/json,null,AVAILABLE,@Spark}
17/11/13 17:58:43 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@15e8d622{/storage,null,AVAILABLE,@Spark}
17/11/13 17:58:43 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@76456aa{/storage/json,null,AVAILABLE,@Spark}
17/11/13 17:58:43 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@134dbd7d{/storage/rdd,null,AVAILABLE,@Spark}
17/11/13 17:58:43 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5203e72{/storage/rdd/json,null,AVAILABLE,@Spark}
17/11/13 17:58:43 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@562aa132{/environment,null,AVAILABLE,@Spark}
17/11/13 17:58:43 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@73de7c5b{/environment/json,null,AVAILABLE,@Spark}
17/11/13 17:58:43 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@d068108{/executors,null,AVAILABLE,@Spark}
17/11/13 17:58:43 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@9d1fd74{/executors/json,null,AVAILABLE,@Spark}
17/11/13 17:58:43 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@af1ee4d{/executors/threadDump,null,AVAILABLE,@Spark}
17/11/13 17:58:43 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@a4d77ec{/executors/threadDump/json,null,AVAILABLE,@Spark}
17/11/13 17:58:43 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@337aa96c{/static,null,AVAILABLE,@Spark}
17/11/13 17:58:43 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@35c6d82a{/,null,AVAILABLE,@Spark}
17/11/13 17:58:43 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@30c0db59{/api,null,AVAILABLE,@Spark}
17/11/13 17:58:43 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@51d328ba{/jobs/job/kill,null,AVAILABLE,@Spark}
17/11/13 17:58:43 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3784da25{/stages/stage/kill,null,AVAILABLE,@Spark}
17/11/13 17:58:43 INFO server.ServerConnector: Started Spark@12f01f6{HTTP/1.1}{0.0.0.0:4040}
17/11/13 17:58:43 INFO server.Server: Started @6561ms
17/11/13 17:58:43 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
17/11/13 17:58:43 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://172.16.99.129:4040
17/11/13 17:58:43 INFO spark.SparkContext: Added file file:/home/hadoop/TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py at spark://172.16.99.129:46877/files/mnist_spark.py with timestamp 1510567123776
17/11/13 17:58:43 INFO util.Utils: Copying /home/hadoop/TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py to /tmp/spark-11fb52a9-864e-4553-ae30-ca785ac0a9bd/userFiles-61ff32dc-8a77-42ed-9a3c-5187b038b9cb/mnist_spark.py
17/11/13 17:58:43 INFO spark.SparkContext: Added file file:/home/hadoop/TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py at spark://172.16.99.129:46877/files/mnist_dist.py with timestamp 1510567123815
17/11/13 17:58:43 INFO util.Utils: Copying /home/hadoop/TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py to /tmp/spark-11fb52a9-864e-4553-ae30-ca785ac0a9bd/userFiles-61ff32dc-8a77-42ed-9a3c-5187b038b9cb/mnist_dist.py
17/11/13 17:58:44 INFO client.StandaloneAppClient$ClientEndpoint: Connecting to master spark://zeka-virtual-machine:7077...
17/11/13 17:58:44 INFO client.TransportClientFactory: Successfully created connection to zeka-virtual-machine/127.0.1.1:7077 after 56 ms (0 ms spent in bootstraps)
17/11/13 17:58:44 INFO cluster.StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20171113175844-0000
17/11/13 17:58:44 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 41447.
17/11/13 17:58:44 INFO netty.NettyBlockTransferService: Server created on 172.16.99.129:41447
17/11/13 17:58:44 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/11/13 17:58:44 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 172.16.99.129, 41447, None)
17/11/13 17:58:44 INFO storage.BlockManagerMasterEndpoint: Registering block manager 172.16.99.129:41447 with 413.9 MB RAM, BlockManagerId(driver, 172.16.99.129, 41447, None)
17/11/13 17:58:44 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 172.16.99.129, 41447, None)
17/11/13 17:58:44 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, 172.16.99.129, 41447, None)
17/11/13 17:58:44 INFO client.StandaloneAppClient$ClientEndpoint: Executor added: app-20171113175844-0000/0 on worker-20171113174403-172.16.99.129-39407 (172.16.99.129:39407) with 1 cores
17/11/13 17:58:44 INFO cluster.StandaloneSchedulerBackend: Granted executor ID app-20171113175844-0000/0 on hostPort 172.16.99.129:39407 with 1 cores, 1024.0 MB RAM
17/11/13 17:58:44 INFO client.StandaloneAppClient$ClientEndpoint: Executor updated: app-20171113175844-0000/0 is now RUNNING
17/11/13 17:58:45 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@c9966de{/metrics/json,null,AVAILABLE,@Spark}
17/11/13 17:58:45 INFO cluster.StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
args: Namespace(batch_size=100, cluster_size=1, epochs=1, format='csv', images='examples/mnist/csv/train/images', labels='examples/mnist/csv/train/labels', mode='train', model='mnist_model', output='predictions', rdma=False, readers=1, steps=1000, tensorboard=False)
2017-11-13T17:58:45.508449 ===== Start
17/11/13 17:58:46 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 237.1 KB, free 413.7 MB)
17/11/13 17:58:47 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 23.0 KB, free 413.7 MB)
17/11/13 17:58:47 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.16.99.129:41447 (size: 23.0 KB, free: 413.9 MB)
17/11/13 17:58:47 INFO spark.SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:0
17/11/13 17:58:47 INFO memory.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 237.2 KB, free 413.4 MB)
17/11/13 17:58:47 INFO memory.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 23.0 KB, free 413.4 MB)
17/11/13 17:58:47 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.16.99.129:41447 (size: 23.0 KB, free: 413.9 MB)
17/11/13 17:58:47 INFO spark.SparkContext: Created broadcast 1 from textFile at NativeMethodAccessorImpl.java:0
zipping images and labels
17/11/13 17:58:49 INFO mapred.FileInputFormat: Total input paths to process : 10
17/11/13 17:58:49 INFO mapred.FileInputFormat: Total input paths to process : 10
2017-11-13 17:58:49,558 INFO (MainThread-4011) Reserving TFSparkNodes 
Traceback (most recent call last):
  File "/home/hadoop/TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py", line 70, in <module>
    cluster = TFCluster.run(sc, mnist_dist.map_fun, args, args.cluster_size, num_ps, args.tensorboard, TFCluster.InputMode.SPARK)
  File "/usr/local/lib/python2.7/dist-packages/tensorflowonspark/TFCluster.py", line 214, in run
    assert num_ps < num_executors
AssertionError
17/11/13 17:58:49 INFO spark.SparkContext: Invoking stop() from shutdown hook
17/11/13 17:58:49 INFO server.ServerConnector: Stopped Spark@12f01f6{HTTP/1.1}{0.0.0.0:4040}
17/11/13 17:58:49 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@3784da25{/stages/stage/kill,null,UNAVAILABLE,@Spark}
17/11/13 17:58:49 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@51d328ba{/jobs/job/kill,null,UNAVAILABLE,@Spark}
17/11/13 17:58:49 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@30c0db59{/api,null,UNAVAILABLE,@Spark}
17/11/13 17:58:49 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@35c6d82a{/,null,UNAVAILABLE,@Spark}
17/11/13 17:58:49 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@337aa96c{/static,null,UNAVAILABLE,@Spark}
17/11/13 17:58:49 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@a4d77ec{/executors/threadDump/json,null,UNAVAILABLE,@Spark}
17/11/13 17:58:49 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@af1ee4d{/executors/threadDump,null,UNAVAILABLE,@Spark}
17/11/13 17:58:49 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@9d1fd74{/executors/json,null,UNAVAILABLE,@Spark}
17/11/13 17:58:49 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@d068108{/executors,null,UNAVAILABLE,@Spark}
17/11/13 17:58:49 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@73de7c5b{/environment/json,null,UNAVAILABLE,@Spark}
17/11/13 17:58:49 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@562aa132{/environment,null,UNAVAILABLE,@Spark}
17/11/13 17:58:49 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@5203e72{/storage/rdd/json,null,UNAVAILABLE,@Spark}
17/11/13 17:58:49 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@134dbd7d{/storage/rdd,null,UNAVAILABLE,@Spark}
17/11/13 17:58:49 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@76456aa{/storage/json,null,UNAVAILABLE,@Spark}
17/11/13 17:58:49 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@15e8d622{/storage,null,UNAVAILABLE,@Spark}
17/11/13 17:58:49 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@6eaf1a9c{/stages/pool/json,null,UNAVAILABLE,@Spark}
17/11/13 17:58:49 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@41cabeed{/stages/pool,null,UNAVAILABLE,@Spark}
17/11/13 17:58:49 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@683f5429{/stages/stage/json,null,UNAVAILABLE,@Spark}
17/11/13 17:58:49 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@79504d96{/stages/stage,null,UNAVAILABLE,@Spark}
17/11/13 17:58:49 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@53ff96ae{/stages/json,null,UNAVAILABLE,@Spark}
17/11/13 17:58:49 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@1607eb68{/stages,null,UNAVAILABLE,@Spark}
17/11/13 17:58:49 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@59f910ed{/jobs/job/json,null,UNAVAILABLE,@Spark}
17/11/13 17:58:49 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@655c03{/jobs/job,null,UNAVAILABLE,@Spark}
17/11/13 17:58:49 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@1e2b72d4{/jobs/json,null,UNAVAILABLE,@Spark}
17/11/13 17:58:49 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@3ffe09eb{/jobs,null,UNAVAILABLE,@Spark}
17/11/13 17:58:49 INFO ui.SparkUI: Stopped Spark web UI at http://172.16.99.129:4040
17/11/13 17:58:49 INFO cluster.StandaloneSchedulerBackend: Shutting down all executors
17/11/13 17:58:49 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
17/11/13 17:58:49 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/11/13 17:58:49 INFO memory.MemoryStore: MemoryStore cleared
17/11/13 17:58:49 INFO storage.BlockManager: BlockManager stopped
17/11/13 17:58:49 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
17/11/13 17:58:49 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/11/13 17:58:49 INFO spark.SparkContext: Successfully stopped SparkContext
17/11/13 17:58:49 INFO util.ShutdownHookManager: Shutdown hook called
17/11/13 17:58:49 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-11fb52a9-864e-4553-ae30-ca785ac0a9bd
17/11/13 17:58:49 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-11fb52a9-864e-4553-ae30-ca785ac0a9bd/pyspark-0e677390-0522-489d-baa6-fe490584e5ad

Besides, my Hadoop and Spark configuration is as followed:

Hadoop

  • hadoop-env.sh
export JAVA_HOME=/usr/java/jdk1.8.0_151
  • core-site.xml
<configuration>
        <property>
             <name>hadoop.tmp.dir</name>
             <value>file:/usr/local/hadoop/tmp</value>
             <description>Abase for other temporary directories.</description>
        </property>
        <property>
             <name>fs.defaultFS</name>
             <value>hdfs://localhost:9000</value>
        </property>
</configuration>
  • hdfs-site.xml
<configuration>
        <property>
             <name>dfs.replication</name>
             <value>1</value>
        </property>
        <property>
             <name>dfs.namenode.name.dir</name>
             <value>file:/usr/local/hadoop/tmp/dfs/name</value>
        </property>
        <property>
             <name>dfs.datanode.data.dir</name>
             <value>file:/usr/local/hadoop/tmp/dfs/data</value>
        </property>
</configuration>

Spark

  • spark-env.sh
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

export HADOOP_HDFS_HOME=/usr/local/hadoop

export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath)

export JAVA_HOME=/usr/java/jdk1.8.0_151

export SPARK_MASTER_IP=172.16.99.129
export SPARK_WORKER_MEMORY=2G

export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=1
export SPARK_WORDER_INSTANCES=1

export SPARK_EXECUTOR_INSTANCES=1

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 21 (6 by maintainers)

Most upvoted comments

@djygithub refer to issue #162 as you said libhdfs.so is the culprit, my sample finished successfully. btw: i used tensorflow 1.3.0+ tensorflowonspark 1.1.0

final

Sorry, meant for the mnist_model, e.g. try /tmp/mnist_model.