tensorflow: tf.signal CPU FFT implementation is slower than NumPy, PyTorch, etc.
What related GitHub issues or StackOverflow threads have you found by searching the web for your problem?
Environment info
Operating System: Ubuntu 16.04 LTS 64bit
Installed version of CUDA and cuDNN:
(please attach the output of ls -l /path/to/cuda/lib/libcud*
):
-rw-r--r-- 1 root root 558720 9月 15 07:02 libcudadevrt.a lrwxrwxrwx 1 root root 16 9月 15 07:05 libcudart.so -> libcudart.so.8.0 lrwxrwxrwx 1 root root 19 9月 15 07:05 libcudart.so.8.0 -> libcudart.so.8.0.44 -rw-r--r-- 1 root root 415432 9月 15 07:02 libcudart.so.8.0.44 -rw-r--r-- 1 root root 775162 9月 15 07:02 libcudart_static.a -rwxr-xr-x 1 root root 79337624 10月 27 23:13 libcudnn.so -rwxr-xr-x 1 root root 79337624 10月 27 23:13 libcudnn.so.5 -rwxr-xr-x 1 root root 79337624 10月 27 23:13 libcudnn.so.5.1.5 -rw-r--r-- 1 root root 69756172 10月 27 23:13 libcudnn_static.a
If installed from binary pip package, provide:
-
A link to the pip package you installed:
-
The output from
python -c "import tensorflow; print(tensorflow.__version__)"
.0.12.head
If installed from source, provide -
The commit hash (
git rev-parse HEAD
) -
The output of
bazel version
If possible, provide a minimal reproducible example (We usually don’t have time to read hundreds of lines of your code)
import numpy as np
import tensorflow as tf
import time
wav = np.random.random_sample((1024,))
spec = np.fft.fft(wav)[:513]
x = tf.placeholder(dtype=tf.complex64, shape=[513])
result = tf.ifft(x)
sess = tf.Session()
start = time.time()
for i in range(10000):
something = sess.run(result, feed_dict={x:spec})
print 'tensorflow:{}s'.format(time.time()-start)
start = time.time()
for i in range(10000):
something = np.fft.ifft(spec)
print 'numpy:{}s'.format(time.time() - start)
tensorflow:25.7219519615s
numpy:0.391902923584s
What other attempted solutions have you tried?
Logs or other output that would be helpful
(If logs are large, please upload as attachment or provide link).
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 49 (33 by maintainers)
Could we expect any update on it? CPU would be pretty popular for inference and improving it from 10+times speech difference would be great.
I think this is a very relevant issue since almost all speech applications need this . I developed some model for speech recognition and in branches 1.13.1 or 1.14, with the contrib spectrogram implementation, run like 10x or 20x faster than with tf 2.x … with the signal.stft implementation.
The situation has not improved with TF 2.6:
@rmothukuru the gist you linked does not seem to be a proper benchmark since it only executes variable assignment in a loop, no computation.
Here is an example demonstrating that TF 2.5 is still 10x slower than numpy: https://colab.research.google.com/gist/Andreas5739738/fc603468829ee0fc7e40a2e27d8a6661/fft.ipynb
Here is a simple benchmark comparing tensorflow and numpy’s real-valued 1D FFT on CPU, showing that TF (version 2.1) is slower than numpy by a factor of 10: https://colab.research.google.com/gist/Andreas5739738/fc603468829ee0fc7e40a2e27d8a6661/fft.ipynb
It’s still an issue – bad bot, @tensorflowbutler!
Looks like things have improved significantly with the 2.9 release, although there still is a large gap to numpy and Jax:
@diggerdu, Sorry for the delayed response. I have executed your code in the latest version of
Tensorflow (2.5)
and observed thatTensorflow
is much faster compared toNumpy
. Please find the Gist of the working code. Thanks!Hi @JPery ,
I seem to notice from your issue here, that you are interested in the stft for audio processing, so we are talking 1D stft. For 1D applications, unfortunately, the only way to currently benefit from the
map_fn
is to compute the transform in batches and have each element in the batch be transformed separately in parallel (the more cores you have the more acceleration you will get basically up to your batch size).The idea behind the acceleration for 2D transforms is that you can write the application of the 2D transform as successive applications of the 1D transform, on each dimension.
After reading the code of the stft, there actually is a way to do it for the stft but it’s a bit involved. So my basic understanding of the stft, is that you will compute an fft (or rftt in this case) over multiple segments (or frames) of your input signal. You could see these segments as multiple elements in a batch, and you can definitely try to transform those in parallel. Schematically, this would mean replacing this line, by something like:
So you could just copy-paste the stft code and replace the final line, you should see a speed-up related to the number of frames you consider and the number of cpus you have at hand.
Yes, we now use DUCC FFT in all of TF, JAX, XLA. This is resolved.
I tried to execute the mentioned code on tf-nightly(2.17.0-dev20240403) on CPU and observed that the time take for the execution on tensorflow is lesser than numpy.
Kindly find the gist of it here. Thank you!
Are there any plans to address this issue any further? It’s a huge inconvenience for anyone wanting to perform FFT transformations in a deployed model.
Looks like Jax team found the same issue with XLA FFT slowness, and integrated PocketFFT as a workaround: https://github.com/google/jax/issues/2952 Would be great if the pocketfft OP could also be integrated into XLA itself, so that both TF and Jax benefit from the speedup.
Actually, using this idea, you can build a much faster FFT for dimension 2 or higher. Here is an example for 2D:
The speed-up is only visible on non-power of 2 shapes, and in this case, on my machine with 8 cores, you can go from
150ms
to30ms
. My guess is the speed-up will be even more significant if you have more cores and higher dimensions.The main problem with FFT ops in TensorFlow that makes them slow is that we compute the FFT plan on every execution instead of caching it for a given size. Due to the multi-threaded nature of op execution, nobody has done the work of implementing a plan cache that would be thread safe. Beyond this, Eigen’s “TensorFFT” itself is not particularly fast when compared to other libraries like FFTW (which we can’t use in TensorFlow due to lack legal approval).