tensorflow: Unable to use multiple CPU cores in TensorFlow
This issue was previously asked here on StackOverflow (with no answers at the time of this issue): https://stackoverflow.com/q/52507748/188046
System information
- Have I written custom code: custom Python code, but no custom ops (Python is linked below)
- OS Platform and Distribution: Ubuntu 16.04.5
- Mobile device: N/A
- TensorFlow installed from: binary, from
tensorflowPyPI package via pip (also tried from conda with same result) - TensorFlow version: v1.11.0-0-gc19e29306c 1.11.0
- Python version: 3.6.6 from conda
- Bazel version: N/A
- GCC/Compiler version: N/A
- CUDA/cuDNN version: N/A, problem occurs on CPU
- GPU model and memory: N/A, problem occurs on CPU
- CPU model: Intel® Xeon® CPU E5-2630 v4 @ 2.20GHz (2 sockets, 10 cores per socket)
- Exact command to reproduce: See below
Describe the problem
I am unable to configure TensorFlow to use multiple CPU cores for inter-op parallelism on my machine. As described in my StackOverflow question, I have read other answers extensively, and scrubbed the first page of Google search results for several keywords, and tried everything I’ve seen suggested, and I just can’t get this to work.
I have included a program below that demonstrates the problem. The program calls matmul once per core (i.e. weak scaling). I would expect that as the number of cores increases, the running time would stay roughly constant. Instead the running time seems to increase linearly with the core count, indicating that the matmul ops are running sequentially, not in parallel.
I have also confirmed via htop that there is only one core on my CPU that is in use when the program is running. The system is otherwise idle. htop has the capability to show multiple threads within a process, but I do not even see these (or they are not using enough CPU to show up on the first page of results when sorted by CPU usage).
How can I get TensorFlow to execute different operations on different cores in parallel?
Note:
- I am creating a session with multiple CPU devices. I have also tried only creating a single CPU device, and relying entirely on
inter_op_parallelism_threads. Nothing I have tried has been able to use multiple cores. - I can comment out the line
with tf.device(d):, and it makes no difference. - I have tried tracing (see the commented out lines), and the trace seems to reflect what I’d expect. Ops are being assigned to the CPUs like I want them to be. However, they still don’t run in parallel.
- I have also tried generating a Chrome trace (commented out lines at the very bottom). The Chrome trace doesn’t seem to be working properly, or at least the reported running times are way off what they should be. So I’m not sure how much this information can be relied upon. Perhaps I’m doing something wrong.
Source code / logs
Source for test_cores.py: https://gist.github.com/elliottslaughter/750a27c832782f4daec8686281027de8
Sample output:
$ python test_cores.py
2018-09-29 09:40:34.489657: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary
was not compiled to use: AVX2 FMA
Running on 1 cores
Assigning matmul to /cpu:0
Duration (via time.perf_counter()): 3.209691 (693823.014392 - 693819.804701)
Clock (via time.clock()): 3.205479 (8.912035 - 5.706556)
Running on 2 cores
Assigning matmul to /cpu:0
Assigning matmul to /cpu:1
Duration (via time.perf_counter()): 6.452124 (693829.493906 - 693823.041782)
Clock (via time.clock()): 6.449224 (15.388567 - 8.939343)
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 4
- Comments: 37
I can confirm that the constant folding pass was the issue. Using
tf.placeholderas suggested does fix the problem. For anyone who comes here later, I’ve updated my gist, and you can see the difference with the new--no-const-foldoption:The updated gist shows how to use
tf.placeholderto achieve this:https://gist.github.com/elliottslaughter/750a27c832782f4daec8686281027de8
Thanks @azaks2 for your help!
@harshini-gadige Thanks for the suggestion. Though in my real use case, I’m not especially interested in intra-op parallelism, you’re right that this is an interesting data point for debugging purposes.
Unfortunately, changing the value of
intra_op_parallelismdoesn’t seem to make a difference. E.g. when I setintra_op_parallelism_threads=n_cpusI still get the exact same running times as when I setintra_op_parallelism_threads=1. I also get the same running times if I hard-code the value to e.g.4.@elliottslaughter Hi, thanks for trying it out. When I tried at my end, I got below results. Please check.
When intra_op_parallelism_threads=1, Result : Running on 1 cores Assigning matmul to /cpu:0 Duration (via time.perf_counter()): 2.494752 (638917.741260 - 638915.246508) Clock (via time.clock()): 2.491805 (5.819835 - 3.328030)
When intra_op_parallelism_threads=2, Result : Running on 2 cores Assigning matmul to /cpu:0 Assigning matmul to /cpu:1 Duration (via time.perf_counter()): 2.647655 (638947.747454 - 638945.099799) Clock (via time.clock()): 5.101251 (7.604016 - 2.502765)