tensorflow: Unable to use multiple CPU cores in TensorFlow

This issue was previously asked here on StackOverflow (with no answers at the time of this issue): https://stackoverflow.com/q/52507748/188046


System information

  • Have I written custom code: custom Python code, but no custom ops (Python is linked below)
  • OS Platform and Distribution: Ubuntu 16.04.5
  • Mobile device: N/A
  • TensorFlow installed from: binary, from tensorflow PyPI package via pip (also tried from conda with same result)
  • TensorFlow version: v1.11.0-0-gc19e29306c 1.11.0
  • Python version: 3.6.6 from conda
  • Bazel version: N/A
  • GCC/Compiler version: N/A
  • CUDA/cuDNN version: N/A, problem occurs on CPU
  • GPU model and memory: N/A, problem occurs on CPU
  • CPU model: Intel® Xeon® CPU E5-2630 v4 @ 2.20GHz (2 sockets, 10 cores per socket)
  • Exact command to reproduce: See below

Describe the problem

I am unable to configure TensorFlow to use multiple CPU cores for inter-op parallelism on my machine. As described in my StackOverflow question, I have read other answers extensively, and scrubbed the first page of Google search results for several keywords, and tried everything I’ve seen suggested, and I just can’t get this to work.

I have included a program below that demonstrates the problem. The program calls matmul once per core (i.e. weak scaling). I would expect that as the number of cores increases, the running time would stay roughly constant. Instead the running time seems to increase linearly with the core count, indicating that the matmul ops are running sequentially, not in parallel.

I have also confirmed via htop that there is only one core on my CPU that is in use when the program is running. The system is otherwise idle. htop has the capability to show multiple threads within a process, but I do not even see these (or they are not using enough CPU to show up on the first page of results when sorted by CPU usage).

How can I get TensorFlow to execute different operations on different cores in parallel?

Note:

  • I am creating a session with multiple CPU devices. I have also tried only creating a single CPU device, and relying entirely on inter_op_parallelism_threads. Nothing I have tried has been able to use multiple cores.
  • I can comment out the line with tf.device(d):, and it makes no difference.
  • I have tried tracing (see the commented out lines), and the trace seems to reflect what I’d expect. Ops are being assigned to the CPUs like I want them to be. However, they still don’t run in parallel.
  • I have also tried generating a Chrome trace (commented out lines at the very bottom). The Chrome trace doesn’t seem to be working properly, or at least the reported running times are way off what they should be. So I’m not sure how much this information can be relied upon. Perhaps I’m doing something wrong.

Source code / logs

Source for test_cores.py: https://gist.github.com/elliottslaughter/750a27c832782f4daec8686281027de8

Sample output:

$ python test_cores.py 
2018-09-29 09:40:34.489657: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary 
was not compiled to use: AVX2 FMA
Running on 1 cores
  Assigning matmul to /cpu:0
  Duration (via time.perf_counter()): 3.209691 (693823.014392 - 693819.804701)
  Clock (via time.clock()): 3.205479 (8.912035 - 5.706556)
Running on 2 cores
  Assigning matmul to /cpu:0
  Assigning matmul to /cpu:1
  Duration (via time.perf_counter()): 6.452124 (693829.493906 - 693823.041782)
  Clock (via time.clock()): 6.449224 (15.388567 - 8.939343)

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 4
  • Comments: 37

Most upvoted comments

I can confirm that the constant folding pass was the issue. Using tf.placeholder as suggested does fix the problem. For anyone who comes here later, I’ve updated my gist, and you can see the difference with the new --no-const-fold option:

$ time python test_cores.py 2 --use-inter
2018-10-18 10:07:53.755221: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Running on 1 CPU devices with 2 inter- and 1 intra-parallelism
  Assigning matmul to /cpu:0
  Assigning matmul to /cpu:0
  Duration (via time.perf_counter()): 46.047578 (1190217.583074 - 1190171.535495)
  Clock (via time.clock()): 46.045166 (51.374167 - 5.329001)

real	0m47.237s
user	0m46.413s
sys	0m5.146s
$ time python test_cores.py 2 --use-inter --no-const-fold
2018-10-18 10:08:49.194386: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Running on 1 CPU devices with 2 inter- and 1 intra-parallelism
  Assigning matmul to /cpu:0
  Assigning matmul to /cpu:0
  Duration (via time.perf_counter()): 25.088182 (1190255.414810 - 1190230.326628)
  Clock (via time.clock()): 48.275543 (57.226300 - 8.950757)

real	0m29.652s
user	0m49.870s
sys	0m7.621s

The updated gist shows how to use tf.placeholder to achieve this:

https://gist.github.com/elliottslaughter/750a27c832782f4daec8686281027de8

Thanks @azaks2 for your help!

@harshini-gadige Thanks for the suggestion. Though in my real use case, I’m not especially interested in intra-op parallelism, you’re right that this is an interesting data point for debugging purposes.

Unfortunately, changing the value of intra_op_parallelism doesn’t seem to make a difference. E.g. when I set intra_op_parallelism_threads=n_cpus I still get the exact same running times as when I set intra_op_parallelism_threads=1. I also get the same running times if I hard-code the value to e.g. 4.

@elliottslaughter Hi, thanks for trying it out. When I tried at my end, I got below results. Please check.

When intra_op_parallelism_threads=1, Result : Running on 1 cores Assigning matmul to /cpu:0 Duration (via time.perf_counter()): 2.494752 (638917.741260 - 638915.246508) Clock (via time.clock()): 2.491805 (5.819835 - 3.328030)

When intra_op_parallelism_threads=2, Result : Running on 2 cores Assigning matmul to /cpu:0 Assigning matmul to /cpu:1 Duration (via time.perf_counter()): 2.647655 (638947.747454 - 638945.099799) Clock (via time.clock()): 5.101251 (7.604016 - 2.502765)