tensorflow: tf.signal CPU FFT implementation is slower than NumPy, PyTorch, etc.

What related GitHub issues or StackOverflow threads have you found by searching the web for your problem?

Environment info

Operating System: Ubuntu 16.04 LTS 64bit

Installed version of CUDA and cuDNN: (please attach the output of ls -l /path/to/cuda/lib/libcud*): -rw-r--r-- 1 root root 558720 9月 15 07:02 libcudadevrt.a lrwxrwxrwx 1 root root 16 9月 15 07:05 libcudart.so -> libcudart.so.8.0 lrwxrwxrwx 1 root root 19 9月 15 07:05 libcudart.so.8.0 -> libcudart.so.8.0.44 -rw-r--r-- 1 root root 415432 9月 15 07:02 libcudart.so.8.0.44 -rw-r--r-- 1 root root 775162 9月 15 07:02 libcudart_static.a -rwxr-xr-x 1 root root 79337624 10月 27 23:13 libcudnn.so -rwxr-xr-x 1 root root 79337624 10月 27 23:13 libcudnn.so.5 -rwxr-xr-x 1 root root 79337624 10月 27 23:13 libcudnn.so.5.1.5 -rw-r--r-- 1 root root 69756172 10月 27 23:13 libcudnn_static.a If installed from binary pip package, provide:

A link to the pip package you installed:
The output from python -c "import tensorflow; print(tensorflow.__version__)". 0.12.head If installed from source, provide
The commit hash (git rev-parse HEAD)
The output of bazel version

If possible, provide a minimal reproducible example (We usually don’t have time to read hundreds of lines of your code)

import numpy as np
import tensorflow as tf
import time

wav = np.random.random_sample((1024,))
spec = np.fft.fft(wav)[:513]


x = tf.placeholder(dtype=tf.complex64, shape=[513])
result = tf.ifft(x) 
sess = tf.Session()

start = time.time()
for i in range(10000):
    something = sess.run(result, feed_dict={x:spec})
print 'tensorflow:{}s'.format(time.time()-start)

start = time.time()
for i in range(10000):
   	something = np.fft.ifft(spec)
print 'numpy:{}s'.format(time.time() - start)

tensorflow:25.7219519615s
numpy:0.391902923584s

What other attempted solutions have you tried?

Logs or other output that would be helpful

(If logs are large, please upload as attachment or provide link).

About this issue

Original URL
State: closed
Created 8 years ago
Comments: 49 (33 by maintainers)

Most upvoted comments

Could we expect any update on it? CPU would be pretty popular for inference and improving it from 10+times speech difference would be great.

+10

keunwoochoi on Sep 15, 2020

I think this is a very relevant issue since almost all speech applications need this . I developed some model for speech recognition and in branches 1.13.1 or 1.14, with the contrib spectrogram implementation, run like 10x or 20x faster than with tf 2.x … with the signal.stft implementation.

yohskar on Jan 22, 2020

The situation has not improved with TF 2.6:

seconds (lower is better):

Tensorflow 2.6.0 26.172108803000015
Numpy:  1.4308665990000122
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
Jax:  1.412082421999969

Andreas5739738 on Aug 17, 2021

@rmothukuru the gist you linked does not seem to be a proper benchmark since it only executes variable assignment in a loop, no computation.

Here is an example demonstrating that TF 2.5 is still 10x slower than numpy: https://colab.research.google.com/gist/Andreas5739738/fc603468829ee0fc7e40a2e27d8a6661/fft.ipynb

Andreas5739738 on May 28, 2021

Here is a simple benchmark comparing tensorflow and numpy’s real-valued 1D FFT on CPU, showing that TF (version 2.1) is slower than numpy by a factor of 10: https://colab.research.google.com/gist/Andreas5739738/fc603468829ee0fc7e40a2e27d8a6661/fft.ipynb

import timeit

print(timeit.timeit('X = tf.signal.rfft(x)', setup='import tensorflow as tf; x = tf.random.normal([50000, 512])', number=10))
print(timeit.timeit('X = numpy.fft.rfft(x)', setup='import numpy.fft; import tensorflow as tf; x = tf.random.normal([50000, 512])', number=10))

24.331672433999984
2.2700748900000463

Andreas5739738 on Mar 12, 2020

It’s still an issue – bad bot, @tensorflowbutler!

rryan on Nov 12, 2019

Looks like things have improved significantly with the 2.9 release, although there still is a large gap to numpy and Jax:

seconds (lower is better):
Tensorflow 2.9.1 5.495112890999991
Tensorflow 2.9.1, double precision 7.629201937000033
Numpy:  2.1803204349999987
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
Jax:  1.4081462569999985

Andreas5739738 on Jun 8, 2022

@diggerdu, Sorry for the delayed response. I have executed your code in the latest version of Tensorflow (2.5) and observed that Tensorflow is much faster compared to Numpy. Please find the Gist of the working code. Thanks!

rmothukuru on May 28, 2021

Hi @JPery ,

I seem to notice from your issue here, that you are interested in the stft for audio processing, so we are talking 1D stft. For 1D applications, unfortunately, the only way to currently benefit from the map_fn is to compute the transform in batches and have each element in the batch be transformed separately in parallel (the more cores you have the more acceleration you will get basically up to your batch size).

The idea behind the acceleration for 2D transforms is that you can write the application of the 2D transform as successive applications of the 1D transform, on each dimension.

After reading the code of the stft, there actually is a way to do it for the stft but it’s a bit involved. So my basic understanding of the stft, is that you will compute an fft (or rftt in this case) over multiple segments (or frames) of your input signal. You could see these segments as multiple elements in a batch, and you can definitely try to transform those in parallel. Schematically, this would mean replacing this line, by something like:

from functools import partial

return tf.map_fn(
	partial(fft_ops.rfft, fft_length=[fft_length]),
	framed_signals,
	parallel_iterations=multiprocessing.cpu_count(),  # or how many parallel ops you see fit
)

So you could just copy-paste the stft code and replace the final line, you should see a speed-up related to the number of frames you consider and the number of cpus you have at hand.

zaccharieramzi on Nov 18, 2020

Yes, we now use DUCC FFT in all of TF, JAX, XLA. This is resolved.

cantonios on Apr 4, 2024

I tried to execute the mentioned code on tf-nightly(2.17.0-dev20240403) on CPU and observed that the time take for the execution on tensorflow is lesser than numpy.

Tensorflow 2.17.0-dev20240403 1.0675210610000079
Numpy:  1.7257418959999882
Jax:  2.5296104049999997

Kindly find the gist of it here. Thank you!

tilakrayal on Apr 4, 2024

Are there any plans to address this issue any further? It’s a huge inconvenience for anyone wanting to perform FFT transformations in a deployed model.

ddgonzal3 on Jun 21, 2022

Looks like Jax team found the same issue with XLA FFT slowness, and integrated PocketFFT as a workaround: https://github.com/google/jax/issues/2952 Would be great if the pocketfft OP could also be integrated into XLA itself, so that both TF and Jax benefit from the speedup.

Andreas5739738 on Feb 3, 2022

Actually, using this idea, you can build a much faster FFT for dimension 2 or higher. Here is an example for 2D:

import multiprocessing

import tensorflow as tf
from tensorflow.python.ops.signal.fft_ops import fft2d, fft


@tf.function
def parallel_fft2d(image):
    partial_fourier_coefs = tf.map_fn(
        fft,
        image,
        parallel_iterations=multiprocessing.cpu_count(),
    )
    partial_fourier_coefs_t = tf.transpose(partial_fourier_coefs)
    fourier_coefs_t = tf.map_fn(
        fft,
        partial_fourier_coefs_t,
        parallel_iterations=multiprocessing.cpu_count(),
    )
    fourier_coefs = tf.transpose(fourier_coefs_t)
    return fourier_coefs

a = tf.cast(tf.random.normal([320, 320]), dtype=tf.complex64)
tf.test.TestCase().assertAllClose(
    fft2d(a),
    parallel_fft2d(a),
    rtol=1e-4,
    atol=1e-3,
)

The speed-up is only visible on non-power of 2 shapes, and in this case, on my machine with 8 cores, you can go from 150ms to 30ms. My guess is the speed-up will be even more significant if you have more cores and higher dimensions.

zaccharieramzi on Oct 21, 2020

The main problem with FFT ops in TensorFlow that makes them slow is that we compute the FFT plan on every execution instead of caching it for a given size. Due to the multi-threaded nature of op execution, nobody has done the work of implementing a plan cache that would be thread safe. Beyond this, Eigen’s “TensorFFT” itself is not particularly fast when compared to other libraries like FFTW (which we can’t use in TensorFlow due to lack legal approval).

rryan on Aug 27, 2019