webdataset: gsutil cat intermittently fails

I keep getting the following error message randomly when using webdataset with gsutil cat. I have num_workers=4 in the dataloader, so it seems unlikely to be too many requests. Any suggestions? It would be nice to at least get a more useful error message.

Exception: ("((' gsutil -o GSUtil:parallel_process_count=1 -o GSUtil:parallel_thread_count=1 cat gs://dc-erik-data/210228_213525/data/v3/train/20.tar',), {'shell':
 True, 'bufsize': 8192}): exit 1 (read) {'nodeinfo': ('erik-dc', 7595), 'rank': -1, 'size': -1, 'worker_id': 0, 'num_workers': 4} @ <Pipe ((' gsutil -o GSUtil:para
llel_process_count=1 -o GSUtil:parallel_thread_count=1 cat gs://dc-erik-data/210228_213525/data/v3/train/20.tar',), {'shell': True, 'bufsize': 8192})>", <webdatase
t.gopen.Pipe object at 0x7fce1cc86e80>, {'__url__': 'pipe: gsutil -o GSUtil:parallel_process_count=1 -o GSUtil:parallel_thread_count=1 cat gs://dc-erik-data/210228
_213525/data/v3/train/20.tar', '__worker__': '(0, 4)', '__rank__': 'None', '__nodeinfo__': "('erik-dc', 7595)"})

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 2
  • Comments: 32 (2 by maintainers)

Most upvoted comments

I would strongly advise against training against Google Cloud directly from outside Google’s compute clusters: Google’s egress costs are high; you are either costing yourself or someone else a lot of money, and any dataset you train against like that may simply go away. And, in addition, WAN connection to Google Cloud Storage just don’t seem entirely reliable.

Unless you are training on Google Compute, just enable caching in WebDataset: you’ll save a lot of money and you won’t have the reliabilty problems.

I suggest trying gsutil cp your_file - instead of gsutil cat and then playing with the retry option -c mentioned at https://cloud.google.com/storage/docs/gsutil/commands/cp

I’m also having issues reading datasets from gs buckets with wds via gopen/pipe (currently on main branch 0.2.3).

I’m training on TPU VM instances, 8 train processes, 4 workers per process on ImageNet-1k/22k and similar large(ish) datasets. Within a few hours (usually within 2 or so epochs of 22k) I get a gsutil cat failing and killing the train session.

I was hoping to switch from TFDS to WDS for cloud training due to simplicity of WDS and much fewer dependencies (ie not needing all of tensorflow + tfds) when I’m using PyTorch. With TFDS, on same machine types, with same # worker processes, I have working TFDS wrapper that has hundres of training days with maybe one or two instances of gs read failures that aborted train (and very few retries logged). So, while the failure clearly isn’t in WDS code, there seems to be reliability issues (or lack of failure handling) using gsutil cat that aren’t happening in TFDS.

Has anyone experiencing this issue come up with a robust solution?

The error I’m typically seeing is

File "/snap/google-cloud-sdk/229/platform/gsutil_py2/gslib/vendored/oauth2client/oauth2client/transport.py", line 282, in request
  connection_type=connection_type)
File "/snap/google-cloud-sdk/229/platform/gsutil_py2/third_party/httplib2/python2/httplib2/__init__.py", line 2192, in request
  cachekey,
File "/snap/google-cloud-sdk/229/platform/gsutil_py2/gslib/gcs_json_media.py", line 453, in OverrideRequest
  headers)
File "/snap/google-cloud-sdk/229/platform/gsutil_py2/gslib/gcs_json_media.py", line 668, in _conn_request
  response = conn.getresponse()
File "/snap/google-cloud-sdk/229/platform/gsutil_py2/gslib/gcs_json_media.py", line 369, in getresponse
  orig_response = http_client.HTTPConnection.getresponse(self)
File "/usr/lib/python2.7/httplib.py", line 1178, in getresponse
  response.begin()
File "/usr/lib/python2.7/httplib.py", line 452, in begin
  version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 416, in _read_status
  raise BadStatusLine("No status line received - the server has closed the connection")
httplib.BadStatusLine: No status line received - the server has closed the connection

and then

File "/home/xxxx/.local/lib/python3.8/site-packages/webdataset/gopen.py", line 86, in read
   self.check_status()
 File "/home/xxxx/.local/lib/python3.8/site-packages/webdataset/gopen.py", line 66, in check_status
   self.wait_for_child()
 File "/home/xxxx/.local/lib/python3.8/site-packages/webdataset/gopen.py", line 81, in wait_for_child
   raise Exception(f"{self.args}: exit {self.status} (read) {info}")
Exception: ('(("gsutil cat \'gs://xxxx/imagenet22k-train-0082.tar\'",), {\'shell\': True, \'bufsize\': 8192}): exit 1 (read) {} @ <Pipe (("gsutil cat \'gs://xxxx/imagenet22k-train-0082.ta
r\'",), {\'shell\': True, \'bufsize\': 8192})>', <webdataset.gopen.Pipe object at 0x7f15155425e0>, 'gs://xxxx/imagenet22k-train-0082.tar')