tensorflow: why is tensorflow.map_fn slow, what is wrong with following code?

I am trying to use tensorflow map_fn to do parallel computation. However it seems to me that the performance gain is not significant.

Here are example code running Python 3.6.5, Tensorflow version 1.12.0 on Ubuntu 14.04 LTS, 28 duo cores (Intel® Xeon® CPU E5-2697 v3 @ 2.60GHz) = 56 processors

These same codes running on Amazon AWS SagerMaker ml-p3-xlarge even took longer time, 227 seconds.

python -c “import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)” b’v1.12.0-0-ga6d8ffa’ 1.12.0

import tensorflow as tf
import time
# version 1
tic = time.time()
elems = np.array(range(1,1000000), dtype=np.float64)
output = tf.map_fn(lambda x: x**6 , elems, dtype=tf.float64,  parallel_iterations=56)
sess = tf.Session()

res = sess.run(output)
toc = time.time() - tic
print("elapsed=", toc)  # 29.47 (seconds)

# version 2
tic = time.time()
elems = np.array(range(1,1000000), dtype=np.float64)
output = tf.map_fn(lambda x: x**6 , elems, dtype=tf.float64,   parallel_iterations=56)
n_cpus=28


with  tf.Session(
config=tf.ConfigProto(log_device_placement=True, 
device_count={ "CPU": n_cpus },
inter_op_parallelism_threads=n_cpus,
intra_op_parallelism_threads=1,

)) as sess:
res = sess.run(output)

toc = time.time() - tic
print("elapsed=", toc)  # 29.26 (seconds)

# version 3
tic = time.time()
elems = np.array(range(1,1000000), dtype=np.float64)
x6 = [ x**6 for x in elems]
toc = time.time() - tic
print("elapsed time=", toc) # 0.5 seconds

What is problem with the above codes? without map_fn, sequential execution version 3 only 0.5 (seconds).

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 1
  • Comments: 40 (5 by maintainers)

Most upvoted comments

Yes. I am experiencing almost the same issue on my side. I don’t want to be rude for the developers who give a lot of contribution to this tool, but I don’t understand why on earth the only way to parallelize user code is so slow. Is this happen because tf.map_fn or tf.while_loop is not designed for parallelization?

Hello @minhhg , We think in your conversation with @vidursatija , the original query has been clarified. Hence we will close this now. Thanks.

Hi msymp, The issue is not solved. It is 28 (seconds) with tensorflow vs 0.5 (seconds) with python. Thanks.

@igormorgado A common pitfall with vectorized_map usage is that the call includes the cost of code vectorization in addition to code execution. So it should typically be placed within a tf.function

Example:

def f(x):
  return tf.vectorized_map(lambda z: tf.math.log(z), x)

inp = tf.random.uniform([200, 200])
f(inp)  # warmup

%timeit f(inp) # Slow call since it vectorizes code in each call!
-> 10 loops, best of 3: 28.9 ms per loop

compiled_f = tf.function(f)
compiled_f(inp)  # warmup
%timeit compiled_f(inp)  # Vectorization process is done once

-> 1000 loops, best of 3: 328 µs per loop

NumPy:
def np_f(x): 
  return np.apply_along_axis(lambda z: np.log(z), 0, arr=x)

np_inp = inp.numpy()
np_f(np_inp)  # warmup
%timeit np_f(np_inp)
-> 1000 loops, best of 3: 961 µs per loop


np.log(np_inp)  # warmup
%timeit(np.log(np_inp))  # Manually vectorized NumPy
-> 1000 loops, best of 3: 249 µs per loop

I experience the same problem with tf.map_fn being slow. Is there any way to accelerate this function?

For me, the best solution is to avoid using it. Use vectorization as many as possible.

For my problem i have to use the tf.boolean_mask-function, which can produce vectors of different lenghts. This output is further reduced to a single vector. Thats the reason i have to use the tf.map_fn since it is possible to compute the tf.boolean_mask for a single element, then reduce it to a scalar and when the computation is done it gets combined to a vector.

vectorized_map does require registering a “converter” for each op that defines how that op is vectorized. Please see comments for RegisterPFor that describes how these can be defined, in case you want to contribute these. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/parallel_for/pfor.py#L844

tf.map_fn was slow for me as well, which is why I was switching to tf.vectorized_map, however some TensorFlow Ops seem to be not supported / vectorized with tf.vectorized_map.

For example:

dataset = tf.data.Dataset.from_tensor_slices([b'\x00\x00\x00\x00']*512)

def f(x):
    x1 = tf.strings.substr(x, 0, 2)
    x2 = tf.strings.substr(x, 2, 2)

    return tf.io.decode_raw(x1, tf.uint8), tf.io.decode_raw(x2, tf.uint8)

dataset = dataset.batch(32).map(lambda x: tf.vectorized_map(f, x))
WARNING:tensorflow:Using a while_loop for converting Substr
WARNING:tensorflow:Using a while_loop for converting DecodeRaw
WARNING:tensorflow:Using a while_loop for converting Substr
WARNING:tensorflow:Using a while_loop for converting DecodeRaw

Which results in execution times of vectorized_map being similar to map_fn.

Any ideas to solve this?

@minhhg @vidursatija

@tf.function(experimental_compile=True)
def f(elems):
  return tf.map_fn(lambda x: x**6 , elems, dtype=tf.float64, parallel_iterations=56)

elems = tf.range(1, 1000000, dtype=np.float64)

f(elems) # warmup
%timeit f(elems)
-> 10 loops, best of 3: 68.3 ms per loop

@tf.function(experimental_compile=True)
def vectorized_f(elems):
  return tf.vectorized_map(lambda x: x**6 , elems)

vectorized_f(elems) # warmup
%timeit vectorized_f(elems)
->1000 loops, best of 3: 1.26 ms per loop

def np_f(elems):
  return [x**6 for x in elems]

np_elems = np.array(range(1, 1000000), dtype=np.float64)
np_f(np_elems) # warmup
%timeit np_f(np_elems)
-> 1 loop, best of 3: 498 ms per loop

@nikhil1008 It looks like your code is also measuring the “graph construction cost” which is likely unintentional.

Here is my measurement, using TF 2.x:

def diff_floor(x):
  xp = 10
  a = tf.range(-xp, xp + 1, 1, 'float32')
  b = tf.tile(x, (2 * xp + 1,))
  b = b - a
  b = 5000 * b
  b = 1 + tf.math.exp(-b)
  b = 1.0 / b
  b = tf.reduce_sum(b) - xp -1
  return b

@tf.function(experimental_compile=True)
def f(a):
  return tf.map_fn(diff_floor, tf.reshape(a, (-1, 1)), parallel_iterations=8)

f(tf.constant(np.random.rand(224, 224, 3), dtype=tf.float32))  # warmup
%timeit f(tf.constant(np.random.rand(224, 224, 3), dtype=tf.float32))
 -> 10 loops, best of 3: 21.8 ms per loop

# Replacing the map_fn with a vectorized_map gives:
->100 loops, best of 3: 5.02 ms per loop

I had the same issues. Vectorized map works faster but still subpar compared with numpy alternatives. Using datasets with same size numpy.apply_along_axis takes 50microseconds (over a numpy array) while tf.vectorized_map takes 25ms over a tensor (numpy is 500x faster), if comparing with tf.map_fn over a tensor, things go even worse, since it takes 300ms (numpy is 6000x faster).

Whoever is coming here, here are some of the pitfalls to be aware of while using vectorized_map: https://www.tensorflow.org/api_docs/python/tf/vectorized_map