tensorflow: Extremely slow eigendecomposition compared to numpy/scipy.

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): macOS 10.15.6
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): v2.3.0-rc2-23-gb36436b087 2.3.0
Python version: 3.7.7
Bazel version (if compiling from source): N/A
GCC/Compiler version (if compiling from source): N/A
CUDA/cuDNN version: N/A
GPU model and memory: N/A

I am using eigendecomposition in Tensorflow and find that it is extremely slow. This is on a mac, so these are CPU computations. But I’ve also done this on a linux box with a GPU and found the same thing. Here’s the code to show Tensorflow’s speed vs numpy and scipy:

import numpy as np
import scipy as sp
import tensorflow as tf
from time import time

A = np.random.randn(400, 400)
A_tf = tf.constant(A)

cur = time()
d, v = sp.linalg.eig(A)
print(f'sp: {time() - cur:4.2f} s')

cur = time()
d, v = np.linalg.eig(A)
print(f'np: {time() - cur:4.2f} s')

cur = time()
d, v = tf.linalg.eig(A_tf)
print(f'tf: {time() - cur:4.2f} s')

This gives the following output:

sp: 0.09 s
np: 0.08 s
tf: 5.04 s

Any ideas of what’s up here?

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 23 (12 by maintainers)

Most upvoted comments

@refraction-ray This solution solved the problem. I achieved about a 20x speed up on my MacBook for N=500. With tf.linalg.eig, 25 iterations of ADAM took ~300 s. With py_function wrapping np.linalg.eig, 25 iterations took ~14 s.

I thought there might be a solution like this, but I couldn’t find any help online about how to use custom gradients with py_function.

Thank you tremendously for your help here!!

seanslice on Sep 6, 2020

This is ultimately due to single threaded Eigen implementations for eig op which could be linked to multithreaded MKL but actually not in tf bazel setup. There are many similar problems complaining the speed as https://github.com/tensorflow/tensorflow/issues/7128, https://github.com/tensorflow/tensorflow/issues/13222, etc., and this problem can only be fully addressed and solved by https://github.com/tensorflow/tensorflow/issues/34924. Namely, by supporting MKL linkage for eigen when compiling tensorflow, but as described in the above issue, I am not fully clear how to make such setup work.

For now there is a workaround using tf.py_function though, which can utilize eig from numpy or scipy in forward pass (which is ultimately provided by multithreded MKL or openblas) and enjoy automatic differentiation at the same time. See a full demo below (the gradient code part is directly copied from tf codebase):

def _SafeReciprocal(x, epsilon=1E-20):
      return x / (x * x + epsilon)

@tf.custom_gradient
def my_eig(x):
    e, v = np.linalg.eig(x)
    def grad(grad_e, grad_v):
        vt = tf.linalg.adjoint(v)
        f = tf.linalg.set_diag(
          _SafeReciprocal(
              tf.expand_dims(e, -2) - tf.expand_dims(e, -1)),
          tf.zeros_like(e))
        f = tf.math.conj(f)
        vgv = tf.matmul(vt, grad_v)
        mid = tf.linalg.diag(grad_e)
        diag_grad_part = tf.linalg.diag(tf.linalg.diag_part(
          tf.cast(tf.math.real(vgv), vgv.dtype)))
        mid += f * (vgv - tf.matmul(tf.matmul(vt, v), diag_grad_part))
        grad_a = tf.linalg.solve(vt, tf.matmul(mid, vt))

        return tf.cast(grad_a, x.dtype)
    return (e, v), grad
    
a = tf.random.normal([400,400])
with tf.GradientTape() as tape:
    tape.watch(a)
    y = tf.py_function(func=my_eig, inp=[a], Tout=[tf.complex64, tf.complex64])
    print(y)
    l =  y[1][2,2]**2/tf.linalg.norm(y[1][:,2])
print(tape.gradient(l, a))

Such an approach provide speed similar to numpy eig and compatible with tf’s AD infrastructure. The 400*400 case with gradient calculation requires time around 0.5s while a pure tf.eig foward pass with back propagation needs 8.5s.

Of course, this is not a perfect solution as py_function has many limitations such as it cannot be serialized, but I guess it is ok for most research cases when one just plays some small things with python.

refraction-ray on Sep 6, 2020