tensorflow: Error message for running tf.nn.max_pool_with_argmax() on CPU

Running tf.nn.max_pool_with_argmax() on CPU gives a very obscure error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'MaxPoolWithArgmax' with these attrs. Registered devices: [CPU], Registered kernels: <no registered kernels>

From this line:

https://github.com/tensorflow/tensorflow/blob/bc64f05d4090262025a95438b42a54bfdc5bcc80/tensorflow/core/kernels/maxpooling_op.cc#L672

I think it’s useful to mention tf.nn.max_pool_with_argmax() is only implemented for GPU instead.

About this issue

Original URL
State: closed
Created 8 years ago
Reactions: 2
Comments: 56 (7 by maintainers)

Most upvoted comments

Is there an alternative for tf.nn.max_pool_with_argmax? Because now I only have CPUs.

MilesZhao on Jul 20, 2017

Is there an alternative for tf.nn.max_pool_with_argmax? Because now I only want to test my trained model on CPU env.

lujian9328 on Jul 31, 2017

Yes, would be great to have tf.nn.max_pool_with_argmax working on CPU.

This ops seems to be very useful for fast encoder-decoder architectures (I see it two segmentation networks – ENet and SegNet) which were designed for ‘real time’ image segmentation. So supposedly people look at performance with a view of running these networks on CPUs or mobile.

asimonov on Oct 18, 2017

@tfboyd any chance to have it available also on CPU? I’d like to contribute if I could, but I never worked on tensorflow source code. Should this issue be reopen or should we file a new one?

dibenedetto on Oct 14, 2017

I was able to make a workaround for CPU. Warning: it is slow, bloated, and basically a last resort. But I can get ENet to run on my cpu at 5s/image… Right now it is only for 2x2 maxpooling, but you could change the numbers and add things to expand upon this example.

net_main, pooling_indices = tf.nn.max_pool_with_argmax(inputs,
                                                                   ksize=[1,2,2,1],
                                                                   strides=[1,2,2,1],
                                                                   padding='SAME',
                                                                   name=scope+'_main_max_pool')

becomes

net_main = tf.nn.max_pool(inputs,
                        ksize=[1,2,2,1],
                        strides=[1,2,2,1],
                        padding='SAME',
                        name=scope+'_main_max_pool')

input_shape = inputs.get_shape().as_list()
mask_shape = [input_shape[0], input_shape [1]/2,input_shape[2]/2, input_shape[3]]
pooling_indices = tf.zeros(mask_shape, dtype=tf.int64)
for n in range(mask_shape[0]):
    for i in range(mask_shape[1]):
        for j in range(mask_shape[2]):
            in_indices = [ [n, w, h] for w in range(i*2, i*2+2) for h in range(j*2, j*2+2)]
            slice = tf.gather_nd(inputs, in_indices)
            argmax = tf.argmax(slice, axis=0)
            indices_location = [[n, i, j, d] for d in range(input_shape[3])]
            sparse_indices = tf.SparseTensor(indices=indices_location, values=argmax, dense_shape=mask_shape)
            pooling_indices = tf.sparse_add(pooling_indices, sparse_indices)

Also, that part of the gpu kernel is on this line. Simply copying that code from the gpu kernel to the cpu kernel may not be so bad for this operation. It is a nice operation to have, and even without optimization, it would probably still be much faster than what I ended up doing 😛

alecGraves on Dec 18, 2017

SpatialMaxPoolWithArgMaxHelper used instead of LaunchMaxPoolingWithArgmax inside MaxPoolingWithArgmaxOp / Compute ?

@mmpinso No, if you look at the code, the LaunchMaxPoolingWithArgmax function will call the cuda code to compute maxPoolingWithArgmax on GPU. The problem is the unpooling operation that uses the scatter function and looks like everybody need the unpooling operation. Can somebody maybe implement this operation for CPU and share the code?

Here is a sample in python, we need to create it in CPP:

def unpool(pool, ind, ksize=[1, 2, 2, 1], scope='unpool'):
    
	with tf.variable_scope(scope):
		input_shape = pool.get_shape().as_list()
		output_shape = (input_shape[0], input_shape[1] * ksize[1], input_shape[2] * ksize[2], input_shape[3])

		flat_input_size = np.prod(input_shape)
		flat_output_shape = [output_shape[0], output_shape[1] * output_shape[2] * output_shape[3]]

		pool_ = tf.reshape(pool, [flat_input_size])
		batch_range = tf.reshape(tf.range(output_shape[0], dtype=ind.dtype), shape=[input_shape[0], 1, 1, 1])
		b = tf.ones_like(ind) * batch_range
		b = tf.reshape(b, [flat_input_size, 1])
		ind_ = tf.reshape(ind, [flat_input_size, 1])
		ind_ = tf.concat([b, ind_], 1)

		ret = tf.scatter_nd(ind_, pool_, shape=flat_output_shape)
		ret = tf.reshape(ret, output_shape)
		return ret

saeed68gm on Jan 15, 2018

@DenisN03，hi, cuz ENet seems to be mush faster then segnet on Desktop，if I try to run it on android in real time ,latency is the most important thing. @mmpinso Yor are ahead of me , I haven’t succeeded to run ENet on CPU even on my Desktop . I just see the full maxpooling_op.cc shared by @saeed68gm，there are some differences between mine . I will try more in the next few days ,once there’s some progress, and I’ll be in sync 😃

liangxiao05 on Jan 14, 2018

@tfboyd seems still not working on CPUs. Can you reopen this?

zc813 on Oct 9, 2017

@DenisN03 We applied the solution I mentioned before and we have it running. Inference time is around 3 seconds on a Huawei P10 Lite.

mmpinso on Jan 19, 2018

@liangxiao05 did you get the following error? No OpKernel was registered to support Op ‘ScatterNd’ with these attrs.

[[Node: ENet/unpool/ScatterNd = ScatterNd[T=DT_FLOAT, Tindices=DT_INT32](ENet/unpool/transpose, ENet/unpool/Reshape_2, ENet/unpool/ScatterNd/shape)]]

@saeed68gm , could this be because of

The commented operation below in the modified maxpooling_op.cc ?

Inside Compute() for MaxPoolingGradWithArgmaxOp lines 1069-1070 // SpatialMaxPoolWithArgMaxHelper<CPUDevice, T>( // context, grad_out, &argmax, grad_in, params, padding_);

SpatialMaxPoolWithArgMaxHelper used instead of LaunchMaxPoolingWithArgmax inside MaxPoolingWithArgmaxOp / Compute ?

The intuition behind is that ScatterNd is used by unpool, which uses the pooling indices that had been computed by max_pooling_with_argmax during the downsampling phase. If such indices had not been computed or “wrongly” computed, this might have an impact on the unpool operation using such indices.

mmpinso on Jan 14, 2018

@saeed68gm Ok I think I understood the context: I should be using this custom tf version only to obtain the android inference library and run the model on the phone. So I’ll install it in a separate env and keep my usual tf installation in another one for the trainings and everything else. I reviewed the config accordingly and now it compiles. ( @liangxiao05 I haven’t forgot regarding the inference time )

mmpinso on Jan 12, 2018

@mmpinso, @liangxiao05 why do you run enet instead of segnet? Did you run enet on cpu?

DenisN03 on Jan 11, 2018

@saeed68gm，thanks,I will follow your instructions next. @mmpinso ,same with you,I’m doing work with ENet on Android. So how’s the latency you run it on your mobile ?

liangxiao05 on Jan 10, 2018