tensorflow: [Java API] Tensor.create() slow for large arrays

The current Java API’s Tensor.create(Object) is really slow - for a batch of 128 images of size 224x224x3 it’s taking around 1.5seconds. To put this into perspective runner.run() with that data and an InceptionV3 graph took below 1second so data prep is x1.5 of the runtime here (for a batch of 32 images it’s around 0.35-0.45sec).

Is this working as intended? When running the Python code (using simple sess.run(fetches, feed_dict=feed_dict)) with which the graph meta file was generated (TF 1.0.1) and feeding a Python array I don’t see such hiccups, the speed is the same as the Java runner.run().

Might it be because of build flags used, maybe I’m missing some optimizations?

For now this small part is killing the whole performance, bringing it down from 130obs/sec (runner.run() time) to about ~45obs/sec (Tensor.create+run()).

A bit of a sidenote, the performance page states:

This will result in poor performance. sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

But currently there’s no other way to feed data from the Java API, right? A queue (able to read from a file and from memory, i.e. from a Java structure) would be amazing.

Jar build command

export CC="/usr/bin/gcc"
export CXX="/usr/bin/g++"
export TF_NEED_CUDA=1
export GCC_HOST_COMPILER_PATH=$CC
export BUILDFLAGS="--config=cuda --copt=-m64 --linkopt=-m64 --copt=-march=native"

bazel build -c opt \
  //tensorflow/java:tensorflow \
  //tensorflow/java:libtensorflow_jni \
  $BUILDFLAGS --spawn_strategy=standalone --genrule_strategy=standalone

Environment info

OS: Ubuntu 16.04 GPU: GPU TITAN X (Pascal) 12GB CPU: Intel® Xeon® Processor E5-2630 v4 10core GPU Drivers: NVidia CUDA Driver Version: 375.39 CUDNN 5.1.5 CUDA 8 Tensorflow version: JAR file built from current master (c25ecb53)

Example

public void test() {
  Random r = new Random();
  int imageSize = 224 * 224 * 3;
  int batch = 128;
  float[][] input = new float[batch][imageSize];
  for(int i = 0; i < batch; i++) {
    for(int j = 0; j < imageSize; j++) {
      input[i][j] = r.nextFloat();
    }
  }

  long start = System.nanoTime();
  Tensor.create(input);
  long end = System.nanoTime();
  // Around 1.5sec
  System.out.println("Took: " + (end - start));
}

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 7
  • Comments: 28 (15 by maintainers)

Most upvoted comments

Ok, yeah it looks like the way to encode a TF string in a buffer is way more complex that I first thought 😕 Basically, it follows as structure like this (for String array [A, B, C]):

[offset A][offset B][offset C][length A][data A][length B][data B][length C][data C]

After a few tests, I think the following code should do the job, I’m curious that you give it a try in your projects and if 1) it works and 2) it is faster, I can push it as a utility in the repo. Also note that there is no padding in the following algorithm, could be added if needed:

  private static ByteBuffer stringArrayToBuffer(String[] values) throws IOException {
    long offsets[] = new long[values.length];
    byte[][] data = new byte[values.length][];
    int dataSize = 0;

    // Convert strings to encoded bytes and calculate required data size, including a varint for each of them
    for (int i = 0; i < values.length; ++i) {
      byte[] byteValue = values[i].getBytes("UTF-8");
      data[i] = byteValue;
      int length = byteValue.length + varintLength(byteValue.length);
      dataSize += length;
      if (i < values.length - 1) {
        offsets[i + 1] = offsets[i] + length;
      }
    }

    // Important: buffer must follow native byte order
    ByteBuffer buffer = ByteBuffer.allocate(dataSize + (offsets.length * 8)).order(ByteOrder.nativeOrder());

    // First, write offsets to each elements in the buffer
    for (int i = 0; i < offsets.length; ++i) {
      buffer.putLong(offsets[i]);
    }
    
    // Second, write strings bytes, each preceded by its length encoded as a varint
    for (int i = 0; i < data.length; ++i) {
      encodeVarint(buffer, data[i].length);
      buffer.put(data[i]);
    }

    return (ByteBuffer)buffer.rewind();
  }
  
  private static void encodeVarint(ByteBuffer buffer, int value) {
      int v = value;
      while (v >= 0x80) {
        buffer.put((byte)((v & 0x7F) | 0x80));
        v >>= 7;
      }
      buffer.put((byte)v);
  }

  private static int varintLength(int length) {
    int len = 1;
    while (length >= 0x80) {
      length >>= 7;
      ++len;
    }
    return len;
  }

Thanks for the detailed description and the sample code, it is very much appreciated!

The create(Object) method call involves use of reflection to determine the shape and copy things over one array at a time, so it is pretty slow, especially as you add dimensions. The create(shape, FloatBuffer) method would be an order-of-magnitude faster. For example:

public void test() {
    Random r = new Random();
    int imageSize = 224 * 224 * 3;
    int batch = 128;
    long[] shape = new long[] {batch, imageSize};
    FloatBuffer buf = FloatBuffer.allocate(imageSize * batch);
    for (int i = 0; i < imageSize * batch; ++i) {
      buf.put(r.nextFloat());
    }
    buf.flip();

    long start = System.nanoTime();
    Tensor.create(shape, buf);
    long end = System.nanoTime();
    System.out.println("Took: " + (end - start));
}

This is still slower than I’d want it to be, have to dig into that, but hopefully it is enough to satisfy your current needs (and session execution should be faster than Python).

Regarding your other question: Yes, using feeds is slower than getting input from queues. While I believe we do have the primitives to enable use of queues from any language, it is admittedly not too easy (as you have to figure out what is being done in python - e.g., start threads that run the enqueue op - and duplicate that). Note that, as per the proposal in #7951 - investing into queues in other languages might not be worth while at this stage.

Do let me know if using the FloatBuffer suffices for now (and I can close this issue, while we look into general performance improvements for the Java API).

By the way, all this is currently being addressed in the new official Java repository for TensorFlow, there is a important project called “Tensor NIO” that I’m working on which will allow users to directly access tensor memory from Java and to read/write its data in an N-dimensional space. That will include what is found in the Buffers class of the previous example.

If anyone is interested to know more about it, just let me know.

@asimshankar this is really great stuff, the timings improved exactly by an order of magnitude as you said, thanks! A batch of 128 takes around 0.2s and the batch of 32 takes around 0.04s. Still I think a bit slower than what Python does but definitely I can work with that, thanks!

Maybe you could add a note in the Java doc about this? I was suspecting the reflection to take a long time in Tensorflow.create(Object) but wasn’t sure if that was the main cause.