tensorflow: ParseTensor (tf.io.parse_tensor) is not vectorized - Vectorizing via tf.vectorized_map uses while_loop

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution: Ubuntu 16.04
  • TensorFlow installed from: Binary
  • TensorFlow version: v2.3.0-rc2-23-gb36436b087 2.3.0
  • Python version: 3.8
  • CUDA/cuDNN version: 10.1
  • GPU model and memory: TITAN V, 12G VRAM

Describe the current behavior See the code below. I’m serializing a list of tensors and then attempt to parse them using (1) naive, single-record parsing and (2) batch-parsing using vectorized_map, which I expect to yield a significant performance increase.

But: tf.io.parse_tensor appear to be not implemented for vectorized parsing, as I’m getting a WARNING:tensorflow:Using a while_loop for converting ParseTensor message and see little to no performance increase!

I find it very surprising that such an essential operation is not vectorized… how else would I parse no-scalar features from e.g. a TFRecord file? Meanwhile, tf.io.parse_example is vectorized.

Describe the expected behavior I would expect that using a vectorized version of tf.io.parse_tensor to yield significant performance increase.

Standalone code to reproduce the issue

import numpy as np
import tensorflow as tf

import time

# This would normaly come from some data stream, e.g. stream of TFRecords
some_tensor_list = [np.zeros(shape=(5,5), dtype=np.int32)]*100000
some_tensor_list_serialized = [tf.io.serialize_tensor(x) for x in some_tensor_list]

# Feed to tf.data
dataset = tf.data.Dataset.from_tensor_slices(some_tensor_list_serialized)

# Parsing whole batch back to tensors
def parse_batch(b):
    return tf.vectorized_map(lambda x: tf.io.parse_tensor(x, out_type = tf.int32), b)

# Parsing single record back to tensors
def parse_single(x):
    return tf.io.parse_tensor(x, out_type = tf.int32)

# Compare speed
def exaust_iterable(it):
    t = time.time()
    for _ in it:
        pass
    print(f'{time.time() - t}s')

# naive
dataset_naive = dataset.map(parse_single)

exaust_iterable(dataset_naive)

# vectorized over batch
dataset_vec = dataset.batch(32)
dataset_vec = dataset_vec.map(parse_batch)

exaust_iterable(dataset_vec)

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 3
  • Comments: 16 (3 by maintainers)

Most upvoted comments

I am aware that I could loop over the batch and parse each tensor individually. I was hoping that I wouldn’t need to do that and could simply call tf.io.parse_tensor on a batch of tensors. It seems weird to me that you can parse a batch of examples with tf.io.parse_example but when it comes to parsing these tensors you need to loop over each tensor because it can’t handle a batch.

Is there a good reason why tf.io.parse_example is vectorized but not tf.io.parse_tensor ?

I agree with @sjang92 - this is still very much needed, specially for shallow models where IO is more of a bottleneck.

To summarize the request, please include a vectorized version of tf.io.parse_tensor

Just came across the same issue, I’m trying to do the following

  1. Load my tfrecord dataset
  2. Apply batch
  3. Parse the batch of tf.train.Examples
  4. Parse the batch of tensors

I’m unable to do 4. since tf.io.parse_tensor only accepts a single tensor and not a batch of tensors.

Here’s my code

import tensorflow as tf

ds = tf.data.TFRecordDataset('ds.tfrecords')
ds = ds.batch(24) # Without this the code works fine

description = {
    'c_encoded': tf.io.FixedLenFeature([], tf.string)
}

for examples in ds:
    # Parse examples in batch
    parsed_example = tf.io.parse_example(examples, description)
    # Fails here since it can only parse one tensor at a time
    c_encoded = tf.io.parse_tensor(parsed_example['c_encoded'], out_type=tf.float32)

Maybe I’m missing something but what’s the point of being able to parse a batch of Examples if you can’t deserialize that batch of tensors? My options are to either loop through the batch and parse each tensor individually but that’s less efficient or to batch my data prior to building my tfrecords but that’s less flexible if I want to quickly change the batch size.

Would be great to have a version of tf.io.parse_tensor that accepts a batch.

+1 on the need for this

I am also interested in a vectorized version of tf.io.parse_tensor

@mhorlacher, Using tf.function has resulted in significant performance increase. Please find the Gist. Thanks!