tensorflow: why is tensorflow.map_fn slow, what is wrong with following code?
I am trying to use tensorflow map_fn to do parallel computation. However it seems to me that the performance gain is not significant.
Here are example code running Python 3.6.5, Tensorflow version 1.12.0 on Ubuntu 14.04 LTS, 28 duo cores (Intel® Xeon® CPU E5-2697 v3 @ 2.60GHz) = 56 processors
These same codes running on Amazon AWS SagerMaker ml-p3-xlarge even took longer time, 227 seconds.
python -c “import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)” b’v1.12.0-0-ga6d8ffa’ 1.12.0
import tensorflow as tf
import time
# version 1
tic = time.time()
elems = np.array(range(1,1000000), dtype=np.float64)
output = tf.map_fn(lambda x: x**6 , elems, dtype=tf.float64, parallel_iterations=56)
sess = tf.Session()
res = sess.run(output)
toc = time.time() - tic
print("elapsed=", toc) # 29.47 (seconds)
# version 2
tic = time.time()
elems = np.array(range(1,1000000), dtype=np.float64)
output = tf.map_fn(lambda x: x**6 , elems, dtype=tf.float64, parallel_iterations=56)
n_cpus=28
with tf.Session(
config=tf.ConfigProto(log_device_placement=True,
device_count={ "CPU": n_cpus },
inter_op_parallelism_threads=n_cpus,
intra_op_parallelism_threads=1,
)) as sess:
res = sess.run(output)
toc = time.time() - tic
print("elapsed=", toc) # 29.26 (seconds)
# version 3
tic = time.time()
elems = np.array(range(1,1000000), dtype=np.float64)
x6 = [ x**6 for x in elems]
toc = time.time() - tic
print("elapsed time=", toc) # 0.5 seconds
What is problem with the above codes? without map_fn, sequential execution version 3 only 0.5 (seconds).
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 1
- Comments: 40 (5 by maintainers)
Yes. I am experiencing almost the same issue on my side. I don’t want to be rude for the developers who give a lot of contribution to this tool, but I don’t understand why on earth the only way to parallelize user code is so slow. Is this happen because tf.map_fn or tf.while_loop is not designed for parallelization?
Hello @minhhg , We think in your conversation with @vidursatija , the original query has been clarified. Hence we will close this now. Thanks.
Hi msymp, The issue is not solved. It is 28 (seconds) with tensorflow vs 0.5 (seconds) with python. Thanks.
@igormorgado A common pitfall with vectorized_map usage is that the call includes the cost of code vectorization in addition to code execution. So it should typically be placed within a tf.function
Example:
For my problem i have to use the
tf.boolean_mask
-function, which can produce vectors of different lenghts. This output is further reduced to a single vector. Thats the reason i have to use thetf.map_fn
since it is possible to compute thetf.boolean_mask
for a single element, then reduce it to a scalar and when the computation is done it gets combined to a vector.vectorized_map does require registering a “converter” for each op that defines how that op is vectorized. Please see comments for RegisterPFor that describes how these can be defined, in case you want to contribute these. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/parallel_for/pfor.py#L844
tf.map_fn
was slow for me as well, which is why I was switching totf.vectorized_map
, however some TensorFlow Ops seem to be not supported / vectorized withtf.vectorized_map
.For example:
Which results in execution times of
vectorized_map
being similar tomap_fn
.Any ideas to solve this?
@minhhg @vidursatija
@nikhil1008 It looks like your code is also measuring the “graph construction cost” which is likely unintentional.
Here is my measurement, using TF 2.x:
I had the same issues. Vectorized map works faster but still subpar compared with numpy alternatives. Using datasets with same size
numpy.apply_along_axis
takes 50microseconds (over a numpy array) whiletf.vectorized_map
takes 25ms over a tensor (numpy is 500x faster), if comparing withtf.map_fn
over a tensor, things go even worse, since it takes300ms
(numpy is 6000x faster).Whoever is coming here, here are some of the pitfalls to be aware of while using
vectorized_map
: https://www.tensorflow.org/api_docs/python/tf/vectorized_map