tensorflow: S3 reads eventually fail with tensorflow's dataset API

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04.3 LTS
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): v1.4.0-rc1-11-g130a514
  • Python version: Python 2.7.12
  • Bazel version (if compiling from source): N/A
  • GCC/Compiler version (if compiling from source): N/A
  • CUDA/cuDNN version: CPU
  • GPU model and memory: N/A
  • Exact command to reproduce: https://gist.github.com/thunterdb/d5f86c79457eea0f1021117ea4bce0ba

Describe the problem

The given script reads the MNIST data from S3 repeatedly from a public bucket, on an EC2 machine in the same region as the S3 bucket.

Running this script eventually leads to python crashing after a few hours, and the following error:

Finished step 31890, time: 05:45:28 12/07/17, result: 32
Finished step 31920, time: 05:46:01 12/07/17, result: 32
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
Aborted

Source code / logs

https://gist.github.com/thunterdb/d5f86c79457eea0f1021117ea4bce0ba

It is hard to say what is happening without debugging symbols. From the very generic error, I suspect this is an issue with the S3 SDK. As a workaround, it would be nice for tensorflow’s S3 plugin to retry or be more resilient to networking issues.

It is currently an annoying issue when running large distributed training jobs, because they crash after a few hours of reading data. Is anyone having a similar experience with S3?

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 16 (10 by maintainers)

Most upvoted comments

@thunterdb @yongtang We’re running into the same issue. Is there a workaround or are we missing a config param somewhere?