TensorFlowOnSpark: SparkOnYarn running TensorflowOnSpark demo mnist blocked

When I running TensorflowOnSpark Demo mnist On YARN blocked. There is no error message. How to fix it ? CentOS 7.0, Spark 2.1.0, Tensorflow 1.0.0 Python 2.7.5 JDK 1.7

submit script:

${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--num-executors 4 \
--queue default \
--executor-memory 20G \
--py-files tfspark.zip,TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.yarn.maxAppAttempts=1 \
TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py \
--images mnist/csv/train/images \
--labels mnist/csv/train/labels \
--mode train \
--model mnist_model

blocked task message:

logfile container_1486634114148_0042_01_000004/hadp/stderr/?start=0
-----------------------------------------
17/02/21 03:25:15 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 9
17/02/21 03:25:15 INFO executor.Executor: Running task 1.0 in stage 2.0 (TID 9)
17/02/21 03:25:15 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 4
17/02/21 03:25:15 INFO memory.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 6.8 KB, free 10.5 GB)
17/02/21 03:25:15 INFO broadcast.TorrentBroadcast: Reading broadcast variable 4 took 16 ms
17/02/21 03:25:15 INFO memory.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 13.8 KB, free 10.5 GB)
17/02/21 03:25:16 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/images/part-00001:0+11231804
17/02/21 03:25:16 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0
17/02/21 03:25:16 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 23.9 KB, free 10.5 GB)
17/02/21 03:25:16 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 16 ms
17/02/21 03:25:16 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 297.3 KB, free 10.5 GB)
17/02/21 03:25:16 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
17/02/21 03:25:16 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 40b332285be4d4a9ec95a02694b20211810cf4ec]
17/02/21 03:25:16 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/02/21 03:25:16 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/02/21 03:25:16 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/02/21 03:25:16 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/02/21 03:25:16 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
17/02/21 03:25:17 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/labels/part-00001:0+245760
17/02/21 03:25:17 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 1
17/02/21 03:25:17 INFO memory.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 23.9 KB, free 10.5 GB)
17/02/21 03:25:17 INFO broadcast.TorrentBroadcast: Reading broadcast variable 1 took 17 ms
17/02/21 03:25:17 INFO memory.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 297.3 KB, free 10.5 GB)
2017-02-21 03:25:17,378 INFO (MainThread-13113) Connected to TFSparkNode.mgr on slave2, ppid=12956, state='running'
2017-02-21 03:25:17,384 INFO (MainThread-13113) mgr.state='running'
2017-02-21 03:25:17,384 INFO (MainThread-13113) Feeding partition <itertools.chain object at 0x2ae71d0> into input queue <multiprocessing.queues.JoinableQueue object at 0x2824b90>
17/02/21 03:25:21 INFO python.PythonRunner: Times: total = 4363, boot = -9675, init = 9762, finish = 4276
17/02/21 03:25:21 INFO python.PythonRunner: Times: total = 109, boot = 4, init = 28, finish = 77
logfile:container_1486634114148_0042_01_000002/hadp/stderr/?start=0
-----------------------------------------
17/02/21 03:18:43 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/02/21 03:18:43 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(1, slave1, 33036, None)
17/02/21 03:18:43 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(1, slave1, 33036, None)
17/02/21 03:18:43 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(1, slave1, 33036, None)
17/02/21 03:18:44 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 0
17/02/21 03:18:44 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)slave1
17/02/21 03:18:44 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 2
17/02/21 03:18:44 INFO client.TransportClientFactory: Successfully created connection to /172.16.165.28:45922 after 2 ms (0 ms spent in bootstraps)
17/02/21 03:18:44 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 4.3 KB, free 10.5 GB)
17/02/21 03:18:44 INFO broadcast.TorrentBroadcast: Reading broadcast variable 2 took 161 ms
17/02/21 03:18:44 INFO memory.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 6.0 KB, free 10.5 GB)
2017-02-21 03:18:55,776 INFO (MainThread-11159) TFSparkNode.reserve: {'authkey': UUID('28cb10c4-7498-4ca8-89ac-b8a604f6048c'), 'worker_num': 0, 'host': 'slave1', 'tb_port': 0, 'addr': ('slave1', 41678), 'ppid': 11123, 'task_index': 0, 'job_name': 'ps', 'tb_pid': 0, 'port': 44860}
17/02/21 03:18:55 INFO python.PythonRunner: Times: total = 11122, boot = 689, init = 76, finish = 10357
17/02/21 03:18:55 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 2473 bytes result sent to driver
17/02/21 03:18:55 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 7
17/02/21 03:18:55 INFO executor.Executor: Running task 3.0 in stage 1.0 (TID 7)
17/02/21 03:18:55 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 3
17/02/21 03:18:56 INFO memory.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 6.0 KB, free 10.5 GB)
17/02/21 03:18:56 INFO broadcast.TorrentBroadcast: Reading broadcast variable 3 took 17 ms
17/02/21 03:18:56 INFO memory.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 9.4 KB, free 10.5 GB)
2017-02-21 03:18:56,066 INFO (MainThread-11159) node: {'addr': ('slave1', 41678), 'task_index': 0, 'job_name': 'ps', 'authkey': UUID('28cb10c4-7498-4ca8-89ac-b8a604f6048c'), 'worker_num': 0, 'host': 'slave1', 'ppid': 11123, 'port': 44860, 'tb_pid': 0, 'tb_port': 0}
2017-02-21 03:18:56,066 INFO (MainThread-11159) node: {'addr': '/tmp/pymp-uwrScZ/listener-T6UoZM', 'task_index': 0, 'job_name': 'worker', 'authkey': UUID('e0d78818-5d92-491a-ab25-22c90992b165'), 'worker_num': 1, 'host': 'slave2', 'ppid': 12956, 'port': 48957, 'tb_pid': 0, 'tb_port': 0}
2017-02-21 03:18:56,067 INFO (MainThread-11159) node: {'addr': '/tmp/pymp-IKhTmR/listener-oOgQs9', 'task_index': 1, 'job_name': 'worker', 'authkey': UUID('c8c444ab-5c1d-420d-aea6-f12a7addeafa'), 'worker_num': 2, 'host': 'slave3', 'ppid': 7641, 'port': 43524, 'tb_pid': 0, 'tb_port': 0}
2017-02-21 03:18:56,067 INFO (MainThread-11159) node: {'addr': '/tmp/pymp-yW3koh/listener-gbHUUU', 'task_index': 2, 'job_name': 'worker', 'authkey': UUID('27b2c855-fe34-47b2-9638-84d5844cb47d'), 'worker_num': 3, 'host': 'slave4', 'ppid': 141447, 'port': 41709, 'tb_pid': 0, 'tb_port': 0}
2017-02-21 03:18:56,317 INFO (MainThread-11159) Connected to TFSparkNode.mgr on slave1, ppid=11123, state='running'
2017-02-21 03:18:56,325 INFO (MainThread-11159) Starting TensorFlow ps:0 on cluster node 0 on background thread
2017-02-21 03:19:02,712 INFO (Thread-1-11159) 0: ======== ps:0 ========
2017-02-21 03:19:02,712 INFO (Thread-1-11159) 0: Cluster spec: {'ps': ['slave1:44860'], 'worker': ['slave2:48957', 'slave3:43524', 'slave4:41709']}
2017-02-21 03:19:02,712 INFO (Thread-1-11159) 0: Using CPU
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job ps -> {0 -> localhost:44860}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job worker -> {0 -> slave2:48957, 1 -> slave3:43524, 2 -> slave4:41709}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:221] Started server with target: grpc://localhost:44860
logfile: container_1486634114148_0042_01_000005/hadp/stderr/?start=0
-----------------------------------------
17/02/21 11:31:43 INFO executor.Executor: Finished task 2.0 in stage 1.0 (TID 6). 2189 bytes result sent to driver
2017-02-21 11:31:44,057 INFO (Thread-1-141481) 3: ======== worker:2 ========
2017-02-21 11:31:44,057 INFO (Thread-1-141481) 3: Cluster spec: {'ps': ['slave1:44860'], 'worker': ['slave2:48957', 'slave3:43524', 'slave4:41709']}
2017-02-21 11:31:44,057 INFO (Thread-1-141481) 3: Using CPU
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job ps -> {0 -> slave1:44860}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job worker -> {0 -> slave2:48957, 1 -> slave3:43524, 2 -> localhost:41709}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:221] Started server with target: grpc://localhost:41709
tensorflow model path: hdfs://172.16.141.90:8020/user/hadp/mnist_model
17/02/21 11:31:48 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 8
17/02/21 11:31:48 INFO executor.Executor: Running task 0.0 in stage 2.0 (TID 8)
17/02/21 11:31:48 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 4
17/02/21 11:31:48 INFO memory.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 6.8 KB, free 10.5 GB)
17/02/21 11:31:48 INFO broadcast.TorrentBroadcast: Reading broadcast variable 4 took 16 ms
17/02/21 11:31:48 INFO memory.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 13.8 KB, free 10.5 GB)
17/02/21 11:31:48 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/images/part-00000:0+9338236
17/02/21 11:31:48 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0
17/02/21 11:31:48 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 23.9 KB, free 10.5 GB)
17/02/21 11:31:48 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 16 ms
17/02/21 11:31:48 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 297.3 KB, free 10.5 GB)
17/02/21 11:31:48 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
17/02/21 11:31:48 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 40b332285be4d4a9ec95a02694b20211810cf4ec]
17/02/21 11:31:48 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/02/21 11:31:48 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/02/21 11:31:48 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/02/21 11:31:48 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/02/21 11:31:48 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
17/02/21 11:31:49 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/labels/part-00000:0+204800
17/02/21 11:31:49 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 1
17/02/21 11:31:49 INFO memory.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 23.9 KB, free 10.5 GB)
17/02/21 11:31:49 INFO broadcast.TorrentBroadcast: Reading broadcast variable 1 took 18 ms
17/02/21 11:31:49 INFO memory.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 297.3 KB, free 10.5 GB)
2017-02-21 11:31:49,476 INFO (MainThread-141591) Connected to TFSparkNode.mgr on slave4, ppid=141447, state='running'
2017-02-21 11:31:49,481 INFO (MainThread-141591) mgr.state='running'
2017-02-21 11:31:49,482 INFO (MainThread-141591) Feeding partition <itertools.chain object at 0x35c8250> into input queue <multiprocessing.queues.JoinableQueue object at 0x3343c50>
I tensorflow/core/distributed_runtime/master_session.cc:1012] Start master session 2752662a91d29de2 with config: 

INFO:tensorflow:Waiting for model to be ready.  Ready_for_local_init_op:  None, ready: Variables not initialized: hid_w, hid_b, sm_w, sm_b, Variable, hid_w/Adagrad, hid_b/Adagrad, sm_w/Adagrad, sm_b/Adagrad
2017-02-21 11:31:50,141 INFO (Thread-1-141481) Waiting for model to be ready.  Ready_for_local_init_op:  None, ready: Variables not initialized: hid_w, hid_b, sm_w, sm_b, Variable, hid_w/Adagrad, hid_b/Adagrad, sm_w/Adagrad, sm_b/Adagrad
17/02/21 11:31:51 INFO python.PythonRunner: Times: total = 1967, boot = -5694, init = 5778, finish = 1883
17/02/21 11:31:51 INFO python.PythonRunner: Times: total = 115, boot = 3, init = 40, finish = 72
I tensorflow/core/distributed_runtime/master_session.cc:1012] Start master session 09ff1aeacc016eab with config: 

2017-02-21T11:32:20.294690 session ready
2017-02-21T11:32:20.364734 step: 0 accuracy: 0.589999973774
2017-02-21 11:32:22,888 INFO (MainThread-141591) Processed 5120 items in partition
17/02/21 11:32:22 INFO python.PythonRunner: Times: total = 33461, boot = 3, init = 38, finish = 33420
17/02/21 11:32:22 INFO executor.Executor: Finished task 0.0 in stage 2.0 (TID 8). 2397 bytes result sent to driver
17/02/21 11:32:22 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 11
17/02/21 11:32:22 INFO executor.Executor: Running task 3.0 in stage 2.0 (TID 11)
17/02/21 11:32:22 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/images/part-00003:0+11226100
17/02/21 11:32:23 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/labels/part-00003:0+245760
2017-02-21 11:32:23,155 INFO (MainThread-141591) Connected to TFSparkNode.mgr on slave4, ppid=141447, state='running'
2017-02-21 11:32:23,161 INFO (MainThread-141591) mgr.state='running'
2017-02-21 11:32:23,162 INFO (MainThread-141591) Feeding partition <itertools.chain object at 0x35d1450> into input queue <multiprocessing.queues.JoinableQueue object at 0x3343c50>
17/02/21 11:32:25 INFO python.PythonRunner: Times: total = 2574, boot = -31926, init = 31933, finish = 2567
17/02/21 11:32:25 INFO python.PythonRunner: Times: total = 85, boot = -33619, init = 33625, finish = 79
2017-02-21 11:32:28,572 INFO (MainThread-141591) Processed 6144 items in partition
17/02/21 11:32:28 INFO python.PythonRunner: Times: total = 5443, boot = -227, init = 245, finish = 5425
17/02/21 11:32:28 INFO executor.Executor: Finished task 3.0 in stage 2.0 (TID 11). 1652 bytes result sent to driver
17/02/21 11:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 13
17/02/21 11:32:28 INFO executor.Executor: Running task 5.0 in stage 2.0 (TID 13)
17/02/21 11:32:28 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/images/part-00005:0+11173834
17/02/21 11:32:28 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/labels/part-00005:0+245760
2017-02-21 11:32:28,803 INFO (MainThread-141591) Connected to TFSparkNode.mgr on slave4, ppid=141447, state='running'
2017-02-21 11:32:28,808 INFO (MainThread-141591) mgr.state='running'
2017-02-21 11:32:28,808 INFO (MainThread-141591) Feeding partition <itertools.chain object at 0x35d1250> into input queue <multiprocessing.queues.JoinableQueue object at 0x3343c50>
17/02/21 11:32:31 INFO python.PythonRunner: Times: total = 2647, boot = -3039, init = 3045, finish = 2641
17/02/21 11:32:31 INFO python.PythonRunner: Times: total = 88, boot = -5557, init = 5563, finish = 82
2017-02-21 11:32:34,287 INFO (MainThread-141591) Processed 6144 items in partition
17/02/21 11:32:34 INFO python.PythonRunner: Times: total = 5508, boot = -193, init = 210, finish = 5491
17/02/21 11:32:34 INFO executor.Executor: Finished task 5.0 in stage 2.0 (TID 13). 1725 bytes result sent to driver
17/02/21 11:32:34 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 15
17/02/21 11:32:34 INFO executor.Executor: Running task 7.0 in stage 2.0 (TID 15)
17/02/21 11:32:34 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/images/part-00007:0+11201024
17/02/21 11:32:34 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/labels/part-00007:0+245760
2017-02-21 11:32:34,506 INFO (MainThread-141591) Connected to TFSparkNode.mgr on slave4, ppid=141447, state='running'
2017-02-21 11:32:34,512 INFO (MainThread-141591) mgr.state='running'
2017-02-21 11:32:34,513 INFO (MainThread-141591) Feeding partition <itertools.chain object at 0x35d1250> into input queue <multiprocessing.queues.JoinableQueue object at 0x3343c50>
17/02/21 11:32:36 INFO python.PythonRunner: Times: total = 2587, boot = -3012, init = 3017, finish = 2582
17/02/21 11:32:36 INFO python.PythonRunner: Times: total = 87, boot = -5609, init = 5614, finish = 82
2017-02-21T11:32:38.988017 step: 400 accuracy: 0.930000007153
2017-02-21 11:32:39,896 INFO (MainThread-141591) Processed 6144 items in partition
17/02/21 11:32:39 INFO python.PythonRunner: Times: total = 5417, boot = -176, init = 193, finish = 5400
17/02/21 11:32:39 INFO executor.Executor: Finished task 7.0 in stage 2.0 (TID 15). 1652 bytes result sent to driver
17/02/21 11:32:39 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 17
17/02/21 11:32:39 INFO executor.Executor: Running task 9.0 in stage 2.0 (TID 17)
17/02/21 11:32:39 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/images/part-00009:0+10449019
17/02/21 11:32:40 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/labels/part-00009:0+229120
2017-02-21 11:32:40,126 INFO (MainThread-141591) Connected to TFSparkNode.mgr on slave4, ppid=141447, state='running'
2017-02-21 11:32:40,132 INFO (MainThread-141591) mgr.state='running'
2017-02-21 11:32:40,132 INFO (MainThread-141591) Feeding partition <itertools.chain object at 0x35d1250> into input queue <multiprocessing.queues.JoinableQueue object at 0x3343c50>
17/02/21 11:32:42 INFO python.PythonRunner: Times: total = 2501, boot = -2971, init = 2977, finish = 2495
17/02/21 11:32:42 INFO python.PythonRunner: Times: total = 85, boot = -5524, init = 5530, finish = 79
2017-02-21T11:32:43.592701 step: 500 accuracy: 0.939999997616
2017-02-21 11:32:45,197 INFO (MainThread-141591) Processed 5728 items in partition
17/02/21 11:32:45 INFO python.PythonRunner: Times: total = 5097, boot = -189, init = 209, finish = 5077
17/02/21 11:32:45 INFO executor.Executor: Finished task 9.0 in stage 2.0 (TID 17). 1739 bytes result sent to driver
logfile: container_1486634114148_0042_01_000003/hadp/stderr/?start=0
-----------------------------------------
2017-02-21 03:17:50,090 INFO (Thread-1-7675) 2: ======== worker:1 ========
2017-02-21 03:17:50,090 INFO (Thread-1-7675) 2: Cluster spec: {'ps': ['slave1:44860'], 'worker': ['slave2:48957', 'slave3:43524', 'slave4:41709']}
2017-02-21 03:17:50,090 INFO (Thread-1-7675) 2: Using CPU
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job ps -> {0 -> slave1:44860}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job worker -> {0 -> slave2:48957, 1 -> localhost:43524, 2 -> slave4:41709}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:221] Started server with target: grpc://localhost:43524
tensorflow model path: hdfs://172.16.141.90:8020/user/hadp/mnist_model
I tensorflow/core/distributed_runtime/master_session.cc:1012] Start master session 3f363c8a0a437ec4 with config: 

INFO:tensorflow:Waiting for model to be ready.  Ready_for_local_init_op:  None, ready: Variables not initialized: hid_w, hid_b, sm_w, sm_b, Variable, hid_w/Adagrad, hid_b/Adagrad, sm_w/Adagrad, sm_b/Adagrad
2017-02-21 03:17:56,292 INFO (Thread-1-7675) Waiting for model to be ready.  Ready_for_local_init_op:  None, ready: Variables not initialized: hid_w, hid_b, sm_w, sm_b, Variable, hid_w/Adagrad, hid_b/Adagrad, sm_w/Adagrad, sm_b/Adagrad
17/02/21 03:17:57 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 10
17/02/21 03:17:57 INFO executor.Executor: Running task 2.0 in stage 2.0 (TID 10)
17/02/21 03:17:57 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 4
17/02/21 03:17:57 INFO memory.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 6.8 KB, free 10.5 GB)
17/02/21 03:17:57 INFO broadcast.TorrentBroadcast: Reading broadcast variable 4 took 17 ms
17/02/21 03:17:57 INFO memory.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 13.8 KB, free 10.5 GB)
17/02/21 03:17:57 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/images/part-00002:0+11214784
17/02/21 03:17:57 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0
17/02/21 03:17:57 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 23.9 KB, free 10.5 GB)
17/02/21 03:17:57 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 16 ms
17/02/21 03:17:57 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 297.3 KB, free 10.5 GB)
17/02/21 03:17:58 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
17/02/21 03:17:58 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 40b332285be4d4a9ec95a02694b20211810cf4ec]
17/02/21 03:17:58 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/02/21 03:17:58 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/02/21 03:17:58 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/02/21 03:17:58 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/02/21 03:17:58 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
17/02/21 03:17:58 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/labels/part-00002:0+245760
17/02/21 03:17:58 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 1
17/02/21 03:17:58 INFO client.TransportClientFactory: Successfully created connection to slave4/172.16.141.91:7812 after 2 ms (0 ms spent in bootstraps)
17/02/21 03:17:58 INFO memory.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 23.9 KB, free 10.5 GB)
17/02/21 03:17:58 INFO broadcast.TorrentBroadcast: Reading broadcast variable 1 took 50 ms
17/02/21 03:17:58 INFO memory.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 297.3 KB, free 10.5 GB)
2017-02-21 03:17:59,006 INFO (MainThread-7786) Connected to TFSparkNode.mgr on slave3, ppid=7641, state='running'
2017-02-21 03:17:59,010 INFO (MainThread-7786) mgr.state='running'
2017-02-21 03:17:59,010 INFO (MainThread-7786) Feeding partition <itertools.chain object at 0x178f390> into input queue <multiprocessing.queues.JoinableQueue object at 0x1e8ed50>
17/02/21 03:18:00 INFO python.PythonRunner: Times: total = 2443, boot = -9670, init = 9736, finish = 2377
17/02/21 03:18:00 INFO python.PythonRunner: Times: total = 143, boot = 3, init = 48, finish = 92
I tensorflow/core/distributed_runtime/master_session.cc:1012] Start master session 6fe3cf85083c4718 with config: 

2017-02-21T03:18:26.451959 session ready
2017-02-21T03:18:29.097871 step: 100 accuracy: 0.949999988079
2017-02-21 03:18:29,681 INFO (MainThread-7786) Processed 6144 items in partition
17/02/21 03:18:29 INFO python.PythonRunner: Times: total = 30754, boot = 4, init = 67, finish = 30683
17/02/21 03:18:29 INFO executor.Executor: Finished task 2.0 in stage 2.0 (TID 10). 2397 bytes result sent to driver
17/02/21 03:18:31 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 12
17/02/21 03:18:31 INFO executor.Executor: Running task 4.0 in stage 2.0 (TID 12)
17/02/21 03:18:31 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/images/part-00004:0+11212767
17/02/21 03:18:31 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/labels/part-00004:0+245760
2017-02-21 03:18:31,804 INFO (MainThread-7786) Connected to TFSparkNode.mgr on slave3, ppid=7641, state='running'
2017-02-21 03:18:31,809 INFO (MainThread-7786) mgr.state='running'
2017-02-21 03:18:31,809 INFO (MainThread-7786) Feeding partition <itertools.chain object at 0x1796590> into input queue <multiprocessing.queues.JoinableQueue object at 0x1e8ed50>
17/02/21 03:18:34 INFO python.PythonRunner: Times: total = 2601, boot = -30602, init = 30610, finish = 2593
17/02/21 03:18:34 INFO python.PythonRunner: Times: total = 88, boot = -32745, init = 32752, finish = 81
2017-02-21T03:18:35.628594 step: 200 accuracy: 0.949999988079
2017-02-21 03:18:37,445 INFO (MainThread-7786) Processed 6144 items in partition
17/02/21 03:18:37 INFO python.PythonRunner: Times: total = 5666, boot = -2082, init = 2100, finish = 5648
17/02/21 03:18:37 INFO executor.Executor: Finished task 4.0 in stage 2.0 (TID 12). 1652 bytes result sent to driver
17/02/21 03:18:37 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 14
17/02/21 03:18:37 INFO executor.Executor: Running task 6.0 in stage 2.0 (TID 14)
17/02/21 03:18:37 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/images/part-00006:0+11214285
17/02/21 03:18:37 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/labels/part-00006:0+245760
2017-02-21 03:18:37,686 INFO (MainThread-7786) Connected to TFSparkNode.mgr on slave3, ppid=7641, state='running'
2017-02-21 03:18:37,691 INFO (MainThread-7786) mgr.state='running'
2017-02-21 03:18:37,692 INFO (MainThread-7786) Feeding partition <itertools.chain object at 0x1796390> into input queue <multiprocessing.queues.JoinableQueue object at 0x1e8ed50>
17/02/21 03:18:40 INFO python.PythonRunner: Times: total = 2705, boot = -3210, init = 3218, finish = 2697
17/02/21 03:18:40 INFO python.PythonRunner: Times: total = 87, boot = -5775, init = 5783, finish = 79
2017-02-21T03:18:40.367697 step: 300 accuracy: 0.949999988079
2017-02-21 03:18:43,355 INFO (MainThread-7786) Processed 6144 items in partition
17/02/21 03:18:43 INFO python.PythonRunner: Times: total = 5685, boot = -212, init = 220, finish = 5677
17/02/21 03:18:43 INFO executor.Executor: Finished task 6.0 in stage 2.0 (TID 14). 1725 bytes result sent to driver
17/02/21 03:18:43 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 16
17/02/21 03:18:43 INFO executor.Executor: Running task 8.0 in stage 2.0 (TID 16)
17/02/21 03:18:43 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/images/part-00008:0+11194141
17/02/21 03:18:43 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/labels/part-00008:0+245760
2017-02-21 03:18:43,580 INFO (MainThread-7786) Connected to TFSparkNode.mgr on slave3, ppid=7641, state='running'
2017-02-21 03:18:43,584 INFO (MainThread-7786) mgr.state='running'
2017-02-21 03:18:43,585 INFO (MainThread-7786) Feeding partition <itertools.chain object at 0x1796390> into input queue <multiprocessing.queues.JoinableQueue object at 0x1e8ed50>
17/02/21 03:18:45 INFO python.PythonRunner: Times: total = 2595, boot = -3124, init = 3131, finish = 2588
17/02/21 03:18:45 INFO python.PythonRunner: Times: total = 88, boot = -5811, init = 5819, finish = 80
2017-02-21T03:18:48.998325 step: 500 accuracy: 0.939999997616
2017-02-21 03:18:49,194 INFO (MainThread-7786) Processed 6144 items in partition
17/02/21 03:18:49 INFO python.PythonRunner: Times: total = 5639, boot = -186, init = 204, finish = 5621
17/02/21 03:18:49 INFO executor.Executor: Finished task 8.0 in stage 2.0 (TID 16). 1652 bytes result sent to driver

image

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 60 (32 by maintainers)

Most upvoted comments

I do see a similar issue with Spark 2.x on YARN (not sure if it is the same since I see slightly different symptoms). It appears to be related to the “worker:0” node attempting to save the model checkpoint into HDFS. Note that Spark 1.6.x on YARN and Spark 2.x Standalone Mode both do not have this issue, so it appears to be due to the combination of Spark 2.x and YARN/HDFS.

To see if you have a similar problem, you can try changing this line of the example to be logdir=None. This will disable saving the checkpoint, but the worker should complete w/o issue. If you see anything different, then you may be seeing a different problem.

@Yiran-wu As @leewyang said, change this line of the example to be logdir=None.

Hi all, I believe that I have a fix for this Spark 2.0 + HDFS issue in PR #35

Please try it out.

FWIW, there were two possible causes of a hang in the MNIST training example.

  1. Apparently, there was some change in Spark 2.x which now requires expansion of the hadoop classpath for JNI
  2. There was a slight chance of a race-condition when a node invoked terminate() while Spark was still feeding an RDD partition (if the terminate logic detected Queue.Empty before the partition was fully fed into the queue).

Keep in mind that this still has the problem of not being able to save the checkpoint/model to HDFS (which may impact other examples as well)… but at least it confirms that we’re all seeing the same issue w.r.t. Spark 2.0 + HDFS

@zxsimple I resolved it by reverting the changes made in https://github.com/yahoo/TensorFlowOnSpark/pull/49/files.

It is my internal cluster set up issue which we need to work on. The communication was happening back to the container running the AM which sent the “waiting on reservations” in a loop. Once I get that resolved, I will pull in the latest changes from here back again.

@leewyang @dogVgod @Yiran-wu It works for me. add following: --conf spark.executorEnv.LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH/lib64:$JAVA_HOME/jre/lib/amd64/server"