ray: [Datasets] [Bug] Access error when reading public data from S3 if no local AWS credentials are configured
Search before asking
- I searched the issues and found no similar issues.
Ray Component
Others
What happened + What you expected to happen
I ran the example on this page https://www.ray.io/ray-datasets
In particular
import ray
# read parquet from S3
parquet_path = "s3://ursa-labs-taxi-data/2019/06/data.parquet"
ds = ray.data.read_parquet(parquet_path)
It failed with
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-2-db174e281e4a> in <module>
1 parquet_path = "s3://ursa-labs-taxi-data/2019/06/data.parquet"
----> 2 ds = ray.data.read_parquet(parquet_path)
~/opt/anaconda3/lib/python3.7/site-packages/ray/data/read_api.py in read_parquet(paths, filesystem, columns, parallelism, ray_remote_args, **arrow_parquet_args)
219 columns=columns,
220 ray_remote_args=ray_remote_args,
--> 221 **arrow_parquet_args)
222
223
~/opt/anaconda3/lib/python3.7/site-packages/ray/data/read_api.py in read_datasource(datasource, parallelism, ray_remote_args, **read_args)
149 """
150
--> 151 read_tasks = datasource.prepare_read(parallelism, **read_args)
152
153 def remote_read(task: ReadTask) -> Block:
~/opt/anaconda3/lib/python3.7/site-packages/ray/data/datasource/parquet_datasource.py in prepare_read(self, parallelism, paths, filesystem, columns, schema, **reader_args)
38
39 paths, file_infos, filesystem = _resolve_paths_and_filesystem(
---> 40 paths, filesystem)
41 file_sizes = [file_info.size for file_info in file_infos]
42
~/opt/anaconda3/lib/python3.7/site-packages/ray/data/datasource/file_based_datasource.py in _resolve_paths_and_filesystem(paths, filesystem)
195 file_infos = []
196 for path in resolved_paths:
--> 197 file_info = filesystem.get_file_info(path)
198 if file_info.type == FileType.Directory:
199 paths, file_infos_ = _expand_directory(path, filesystem)
~/opt/anaconda3/lib/python3.7/site-packages/pyarrow/_fs.pyx in pyarrow._fs.FileSystem.get_file_info()
~/opt/anaconda3/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
~/opt/anaconda3/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
OSError: When getting information for key '2019/06/data.parquet' in bucket 'ursa-labs-taxi-data': AWS Error [code 15]: No response body.
Versions / Dependencies
Ray: ‘1.6.0’ Pyarrow: ‘4.0.1’ Python: Python 3.7.4 OS: MacOS 10.15.7
Reproduction script
Included above
Anything else
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 25 (24 by maintainers)
Commits related to this issue
- [Datasets] Add clearer actionable error message for AWS S3 credential error (#26619) In https://github.com/ray-project/ray/issues/19799, and https://github.com/ray-project/ray/issues/24184, we found ... — committed to ray-project/ray by c21 2 years ago
- [Datasets] Add clearer actionable error message for AWS S3 credential error (#26619) In https://github.com/ray-project/ray/issues/19799, and https://github.com/ray-project/ray/issues/24184, we found ... — committed to xwjiang2010/ray by c21 2 years ago
- [Datasets] Add clearer actionable error message for AWS S3 credential error (#26619) In https://github.com/ray-project/ray/issues/19799, and https://github.com/ray-project/ray/issues/24184, we found ... — committed to smorad/ray by c21 2 years ago
- [Datasets] Add clearer actionable error message for AWS S3 credential error (#26619) In https://github.com/ray-project/ray/issues/19799, and https://github.com/ray-project/ray/issues/24184, we found ... — committed to franklsf95/ray by c21 2 years ago
- [Datasets] Add clearer actionable error message for AWS S3 credential error (#26619) In https://github.com/ray-project/ray/issues/19799, and https://github.com/ray-project/ray/issues/24184, we found ... — committed to gramhagen/ray by c21 2 years ago
- [Datasets] Add clearer actionable error message for AWS S3 credential error (#26619) In https://github.com/ray-project/ray/issues/19799, and https://github.com/ray-project/ray/issues/24184, we found ... — committed to Stefan-1313/ray_mod by c21 2 years ago
I followed up with @dmatrix offline and currently everything is working in his environment, but he will let us know if the error happens again.
Right now, I can only repro the problem in one setting, which is the CI (https://github.com/ray-project/ray/pull/26482). Will keep digging and seeing if I can find out more.
It looks like this is working without credentials now for
read_parquet(possibly, I’m not 100% sure – I tried to be as careful as possible to remove all credentials but I can’t be sure there is none left).This is not however currently working for
ray.data.read_binary_filesis seems.EDIT: This was wrong –
ray.data.read_binary_filesworks on the taxi dataset without credentials too. It doesn’t work on one of our own datasets even though it is publicly accessibly (e.g. via https). Some bucket policy might be configured incorrectly.EDIT: We figured it out now, the bucket also needs to allow the action
"s3:ListBucket"for the principal"*"– before it only had"s3:GetObject"and"s3:GetObjectVersion". After the change, access now works without credentials.