ray: [Datasets] [Bug] Access error when reading public data from S3 if no local AWS credentials are configured

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Others

What happened + What you expected to happen

I ran the example on this page https://www.ray.io/ray-datasets

In particular

import ray
 
# read parquet from S3
parquet_path = "s3://ursa-labs-taxi-data/2019/06/data.parquet"
ds = ray.data.read_parquet(parquet_path)

It failed with

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-2-db174e281e4a> in <module>
      1 parquet_path = "s3://ursa-labs-taxi-data/2019/06/data.parquet"
----> 2 ds = ray.data.read_parquet(parquet_path)

~/opt/anaconda3/lib/python3.7/site-packages/ray/data/read_api.py in read_parquet(paths, filesystem, columns, parallelism, ray_remote_args, **arrow_parquet_args)
    219         columns=columns,
    220         ray_remote_args=ray_remote_args,
--> 221         **arrow_parquet_args)
    222 
    223 

~/opt/anaconda3/lib/python3.7/site-packages/ray/data/read_api.py in read_datasource(datasource, parallelism, ray_remote_args, **read_args)
    149     """
    150 
--> 151     read_tasks = datasource.prepare_read(parallelism, **read_args)
    152 
    153     def remote_read(task: ReadTask) -> Block:

~/opt/anaconda3/lib/python3.7/site-packages/ray/data/datasource/parquet_datasource.py in prepare_read(self, parallelism, paths, filesystem, columns, schema, **reader_args)
     38 
     39         paths, file_infos, filesystem = _resolve_paths_and_filesystem(
---> 40             paths, filesystem)
     41         file_sizes = [file_info.size for file_info in file_infos]
     42 

~/opt/anaconda3/lib/python3.7/site-packages/ray/data/datasource/file_based_datasource.py in _resolve_paths_and_filesystem(paths, filesystem)
    195     file_infos = []
    196     for path in resolved_paths:
--> 197         file_info = filesystem.get_file_info(path)
    198         if file_info.type == FileType.Directory:
    199             paths, file_infos_ = _expand_directory(path, filesystem)

~/opt/anaconda3/lib/python3.7/site-packages/pyarrow/_fs.pyx in pyarrow._fs.FileSystem.get_file_info()

~/opt/anaconda3/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

~/opt/anaconda3/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

OSError: When getting information for key '2019/06/data.parquet' in bucket 'ursa-labs-taxi-data': AWS Error [code 15]: No response body.

Versions / Dependencies

Ray: ‘1.6.0’ Pyarrow: ‘4.0.1’ Python: Python 3.7.4 OS: MacOS 10.15.7

Reproduction script

Included above

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 25 (24 by maintainers)

Commits related to this issue

Most upvoted comments

I followed up with @dmatrix offline and currently everything is working in his environment, but he will let us know if the error happens again.

Right now, I can only repro the problem in one setting, which is the CI (https://github.com/ray-project/ray/pull/26482). Will keep digging and seeing if I can find out more.

It looks like this is working without credentials now for read_parquet (possibly, I’m not 100% sure – I tried to be as careful as possible to remove all credentials but I can’t be sure there is none left).

This is not however currently working for ray.data.read_binary_files is seems.

EDIT: This was wrong – ray.data.read_binary_files works on the taxi dataset without credentials too. It doesn’t work on one of our own datasets even though it is publicly accessibly (e.g. via https). Some bucket policy might be configured incorrectly.

EDIT: We figured it out now, the bucket also needs to allow the action "s3:ListBucket" for the principal "*" – before it only had "s3:GetObject" and "s3:GetObjectVersion". After the change, access now works without credentials.