tensorflow: Converting numpy array to TFRecord is slow

FloatList and Feature is slow for numpy array.

Saving numpy arrays with np.load and np.save is much faster than Converting to TFRecord and reading it back. while profiling the code, I found that half of the time is spent in _floats_feature. tf.train.FloatList is taking 1/3 of the time. How to speed this up?

System information

  • Below snippet of code to convert numpy array is much slow compared np.save, np.load:
  • **OS Platform and Distribution: Linux Ubuntu 16.04
  • TensorFlow version (use command below): 1.4.0
  • Python version: 2.7.12

Source code / logs

import tensorflow as tf
import numpy as np



def floatme(value):
    return tf.train.FloatList(value=value)

def _floats_feature(value):
    return tf.train.Feature(float_list=floatme(value))

tfr_filename = "deleteme.tfr"
data = [" ".join(np.random.randint(0, 1000, size=4005).astype(str)) for i in range(10000)]
with tf.python_io.TFRecordWriter(tfr_filename) as writer:
    print('Converting to vectors')
    vectors = [np.fromstring(line, dtype=int, sep=' ', count=4004+1) for line in data]
    print('Converting to examples')
    for i, vec in enumerate(vectors):
        # Create an example protocol buffer
        example = tf.train.Example(features=tf.train.Features(feature={
            'label': _floats_feature([vec[4004], vec[4004]<1.0]),
            'data' : _floats_feature(vec[:4004]),
            }))
        writer.write(example.SerializeToString())


ncalls tottime percall cumtime percall filename:lineno(function)
232810 49.887 0 49.887 0 convert_train_dataset_tfrecord.py:76(floatme)
116405 20.095 0 20.095 0 {numpy.core.multiarray.fromstring}
232810 13.328 0 63.216 0 convert_train_dataset_tfrecord.py:79(_floats_feature)

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 8
  • Comments: 24 (19 by maintainers)

Most upvoted comments

The current cache implementation uses TFRecords TensorBundles, which are not great (performance-wise) for data reading and writing (and also doesn’t support other things like indexing into a specific record, etc…). We are still thinking through a better file format internally and will provide updates when we think we have a better solution.

@harahu The cache(filename=...) transformation could indeed by used as a stop gap solution for serializing and deserializing data.

@areeh the prospective “save” and “load” functionality would work similar to the rest of tf.data transformations (in that it would support streaming of elements). In other words, it would not require that all of the data can fit into memory.

@rohan100jain and @frankchn are working on a mechanism for persisting the outputs of (a prefix of) an input pipeline which needs to solve the same problem (efficiently serializing elements of a tf.data.Dataset).

I believe that their solution could be extended to provide “save” and “load” functionality, but I also expect that it might take some time to settle on a format for which backwards compatibility is provided (i.e. it might initially be only possible to “load” data which was “save” using the same version of TensorFlow).

I’d like to add to this, it seems as though the instantiation of a tf.train.Features object takes a tremendous amount of time. A very simple example of timings on my machine:

start = time.perf_counter()

for i in range(2000000):
    example = tf.train.Example()

time.perf_counter() - start

The instantiation of 2,000,000 examples with no features takes .76 seconds.

start = time.perf_counter()

for i in range(2000000):
    example = tf.train.Example()
    feature_1 = tf.train.Int64List(value=[10])
    feature_2 = tf.train.Int64List(value=[10])
        
time.perf_counter() - start

The instantiation of 2,000,000 examples and two tf.train.Int64List objects takes 5 seconds.

start = time.perf_counter()

for i in range(2000000):
    example = tf.train.Example()
    feature = tf.train.Feature(int64_list=tf.train.Int64List(value=[10]))
    label = tf.train.Feature(int64_list=tf.train.Int64List(value=[10]))
        
time.perf_counter() - start

The instantiation of 2,000,000 examples and two tf.train.Int64List features takes 11 seconds.

start = time.perf_counter()

for i in range(2000000):
    example = tf.train.Example(features = tf.train.Features(
            feature={
                'src': tf.train.Feature(int64_list=tf.train.Int64List(value=[10])),
                'dst': tf.train.Feature(int64_list=tf.train.Int64List(value=[10])),
            }
        )
    )
        
time.perf_counter() - start

The instantiation of 2,000,000 examples with two tf.train.Int64List features takes 41 seconds.

And finally, when I put it all together with a TFRecordWriter:

start = time.perf_counter()

with tf.python_io.TFRecordWriter('/mnt/data/repository/test.tfrecord') as writer:
    for i in range(2000000):
        example = tf.train.Example(features = tf.train.Features(
                feature={
                    'src': tf.train.Feature(int64_list=tf.train.Int64List(value=[10])),
                    'dst': tf.train.Feature(int64_list=tf.train.Int64List(value=[10])),
                }
            )
        )

        serialized = example.SerializeToString()
        writer.write(serialized)
        
time.perf_counter() - start

The final timing is 64 seconds for 2,000,000 records.

Unfortunately when you have a dataset that’s got 1 billion rows it means that it takes 8.8 hours just to convert to tfrecord files.

@jsimsa This sounds very promising, but just want to mention that one use-case for efficient serialization of tf.data is handling a dataset that is too large to fit in memory. If there is only a simple “save” and “load” command that attempts to load everything into memory that use is not covered. For me, the ideal solution would be something that can save/load batch by batch or similar so that I can avoid memory issues. Not implying you didn’t already think of this, just trying to show interest for the specific part of the feature I would find most helpful.