tensorflow: it's slow to train model when loading dataset and checkpoints from S3

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Amazon Linux, P2 Instance

TensorFlow installed from (source or binary): pip

TensorFlow version (use command below): 1.5.0.dev20171113

Python version: 2.7

CUDA/cuDNN version: CUDA 8.0 / cuDNN 6.0

GPU model and memory: Tesla K80 12GB

Describe the problem

The dataset and checkpoints are stored in S3 filesystem. It’s slowly when training model. Howevery, The structure of network is simple and the speed is fast when load dataset and checkpoints from local filesystem

Source code / logs

some code like

def input_fn(mode):
    """ Input callback for Estimator
    
    Arguments
    ----
    mode: str, tf.estimator.ModelKey
    file_pattern: tensorflow file pattern, refer to `tf.gfile.Glob`

    Return
    ---
    features: dict of Tensor, the input features for model
    label: single Tensor, the input label for model, must be integeral
    """
    is_train = tf.estimator.ModeKeys.TRAIN == mode

    vocab_dir = hparams.vocab_dir

    ds_dir = 'train' if is_train else 'test'
    file_pattern = os.path.join(hparams.dataset_dir, ds_dir, 'part-')

    tfrecords_files = [] 
    if file_pattern.startswith('s3'):
        tfrecords_files = list_s3_file(file_pattern)
    else:
        tfrecords_files = tf.gfile.Glob(file_pattern)

    example = create_example(vocab_dir)
    example_spec = tf.feature_column.make_parse_example_spec(feature_columns=example)

    batch_size = hparams.batch_size
    num_epochs = hparams.num_epochs if is_train else 1
    example_parsed = tf.contrib.learn.read_batch_record_features(file_pattern=tfrecords_files,
                            batch_size=batch_size,
                            num_epochs=num_epochs,
                            features=example_spec)

    label = example_parsed.pop('label')
    features = example_parsed

    return features, label

The log


INFO:tensorflow:Into main function
INFO:tensorflow:vocab_dir: s3://experiements/yajun/youtube-match/data/raw/vocab
INFO:tensorflow:Create Estimator
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_tf_random_seed': None, '_task_type': 'worker', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7efcfa4bff90>, '_model_dir': 's3://experiements/yajun/youtube-match/data/model', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_master': '', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_save_summary_steps': 100, '_num_ps_replicas': 0}
INFO:tensorflow:Begin to train model
INFO:tensorflow:List files in S3 according to the file pattern: s3://experiements/yajun/youtube-match/data/raw/train/part-
INFO:tensorflow:Create CheckpointSaverHook.
2017-11-15 03:03:41.798030: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2017-11-15 03:03:45.623759: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-11-15 03:03:45.624258: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1031] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
totalMemory: 11.17GiB freeMemory: 11.11GiB
2017-11-15 03:03:45.624285: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
INFO:tensorflow:Restoring parameters from s3://experiements/yajun/youtube-match/data/model/model.ckpt-1400
^[[O^[[I
^[[OINFO:tensorflow:Saving checkpoints for 1401 into s3://experiements/yajun/youtube-match/data/model/model.ckpt.
INFO:tensorflow:loss = 4.90536, step = 1401
INFO:tensorflow:global_step/sec: 188.871
INFO:tensorflow:loss = 7.86743, step = 1501 (0.530 sec)
INFO:tensorflow:global_step/sec: 219.692
INFO:tensorflow:loss = 2.34025, step = 1601 (0.455 sec)
INFO:tensorflow:global_step/sec: 224.707
INFO:tensorflow:loss = 2.32145, step = 1701 (0.445 sec)
2017-11-15 03:14:24.980234: W tensorflow/core/framework/op_kernel.cc:1192] Out of range: FIFOQueue '_4_dequeue_record_examples/fifo_queue' is closed and has insufficient elements (requested 1, current size 0)
	 [[Node: dequeue_record_examples/fifo_queue_Dequeue = QueueDequeueV2[component_types=[DT_INT64, DT_INT64, DT_STRING, DT_INT64, DT_STRING], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](dequeue_record_examples/fifo_queue)]]
INFO:tensorflow:Saving checkpoints for 1750 into s3://experiements/yajun/youtube-match/data/model/model.ckpt.
INFO:tensorflow:Loss for final step: 6.49392.
INFO:tensorflow:Begin to export model to s3://experiements/yajun/youtube-match/data/model/exports
2017-11-15 03:14:29.600289: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
INFO:tensorflow:Restoring parameters from s3://experiements/yajun/youtube-match/data/model/model.ckpt-1750
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: s3://experiements/yajun/youtube-match/data/model/exports/temp-1510715669/saved_model.pb
INFO:tensorflow:Finish to run

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 17 (17 by maintainers)

Most upvoted comments

I’ve also noticed this issue when running Tensorboard 1.8.0 with an s3:// path for the logdir. It takes a long time (several minutes) to read .tfevents files that are less than 50 MB.

Here’s a way I think this can be reproduced:

  1. Put a TF events file from a training run in S3. I used one that was ~43 MB
  2. Run the following Python code:
from tensorboard.backend.event_processing.event_file_loader import EventFileLoader
loader = EventFileLoader('s3://path/to/tfevents/file')
generator = loader.Load()
for event in g:
  continue

This took around 50 minutes to complete with the official TensorFlow 1.8.0 installed via pip. The slowdown is also apparent if you point Tensoboard to a logdir in S3 - the graphs load much more slowly when reading from S3.

I was looking at the S3 filesystem code and noticed an extra data copy in S3RandomAccessFile::Read(). I removed this and rebuilt TensorFlow. Afterwards, Tensorboard seems to load much faster from S3. This change should also help when loading datasets and checkpoints from S3.

PR here: https://github.com/tensorflow/tensorflow/pull/20034