tensorflow: Converting numpy array to TFRecord is slow
FloatList and Feature is slow for numpy array.
Saving numpy arrays with np.load and np.save is much faster than Converting to TFRecord and reading it back. while profiling the code, I found that half of the time is spent in _floats_feature. tf.train.FloatList is taking 1/3 of the time. How to speed this up?
System information
- Below snippet of code to convert numpy array is much slow compared np.save, np.load:
- **OS Platform and Distribution: Linux Ubuntu 16.04
- TensorFlow version (use command below): 1.4.0
- Python version: 2.7.12
Source code / logs
import tensorflow as tf
import numpy as np
def floatme(value):
return tf.train.FloatList(value=value)
def _floats_feature(value):
return tf.train.Feature(float_list=floatme(value))
tfr_filename = "deleteme.tfr"
data = [" ".join(np.random.randint(0, 1000, size=4005).astype(str)) for i in range(10000)]
with tf.python_io.TFRecordWriter(tfr_filename) as writer:
print('Converting to vectors')
vectors = [np.fromstring(line, dtype=int, sep=' ', count=4004+1) for line in data]
print('Converting to examples')
for i, vec in enumerate(vectors):
# Create an example protocol buffer
example = tf.train.Example(features=tf.train.Features(feature={
'label': _floats_feature([vec[4004], vec[4004]<1.0]),
'data' : _floats_feature(vec[:4004]),
}))
writer.write(example.SerializeToString())
ncalls | tottime | percall | cumtime | percall | filename:lineno(function) |
---|---|---|---|---|---|
232810 | 49.887 | 0 | 49.887 | 0 | convert_train_dataset_tfrecord.py:76(floatme) |
116405 | 20.095 | 0 | 20.095 | 0 | {numpy.core.multiarray.fromstring} |
232810 | 13.328 | 0 | 63.216 | 0 | convert_train_dataset_tfrecord.py:79(_floats_feature) |
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 8
- Comments: 24 (19 by maintainers)
The current
cache
implementation usesTFRecordsTensorBundles, which are not great (performance-wise) for data reading and writing(and also doesn’t support other things like indexing into a specific record, etc…). We are still thinking through a better file format internally and will provide updates when we think we have a better solution.@harahu The
cache(filename=...)
transformation could indeed by used as a stop gap solution for serializing and deserializing data.@areeh the prospective “save” and “load” functionality would work similar to the rest of tf.data transformations (in that it would support streaming of elements). In other words, it would not require that all of the data can fit into memory.
@rohan100jain and @frankchn are working on a mechanism for persisting the outputs of (a prefix of) an input pipeline which needs to solve the same problem (efficiently serializing elements of a tf.data.Dataset).
I believe that their solution could be extended to provide “save” and “load” functionality, but I also expect that it might take some time to settle on a format for which backwards compatibility is provided (i.e. it might initially be only possible to “load” data which was “save” using the same version of TensorFlow).
I’d like to add to this, it seems as though the instantiation of a
tf.train.Features
object takes a tremendous amount of time. A very simple example of timings on my machine:The instantiation of 2,000,000 examples with no features takes .76 seconds.
The instantiation of 2,000,000 examples and two
tf.train.Int64List
objects takes 5 seconds.The instantiation of 2,000,000 examples and two
tf.train.Int64List
features takes 11 seconds.The instantiation of 2,000,000 examples with two
tf.train.Int64List
features takes 41 seconds.And finally, when I put it all together with a TFRecordWriter:
The final timing is 64 seconds for 2,000,000 records.
Unfortunately when you have a dataset that’s got 1 billion rows it means that it takes 8.8 hours just to convert to tfrecord files.
@jsimsa This sounds very promising, but just want to mention that one use-case for efficient serialization of tf.data is handling a dataset that is too large to fit in memory. If there is only a simple “save” and “load” command that attempts to load everything into memory that use is not covered. For me, the ideal solution would be something that can save/load batch by batch or similar so that I can avoid memory issues. Not implying you didn’t already think of this, just trying to show interest for the specific part of the feature I would find most helpful.