ray: [Tune] Configuring Tune with Oracle Cloud S3-like object storage fails to upload empty marker files
What happened + What you expected to happen
I’m using a non-AWS s3 storage to save the training outputs, so in the SyncConfig configuration it’s necessary to pass the endpoint_override in the upload_dir:
run_config=air.RunConfig(
local_dir='checkpoint',
sync_config=tune.SyncConfig(
upload_dir='s3://bucket-test/ray/tune/checkpoints/test?scheme=http&endpoint_override=localhost:9000',
syncer='auto'
),
stop={"training_iteration": 10}
),
The problem is that when saving the checkpoints, the endpoint_override settings are not maintained, returning the error:
(Classification pid=30834) Exception in thread Thread-3 (_try_fn):
(Classification pid=30834) Traceback (most recent call last):
(Classification pid=30834) File "/usr/local/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
(Classification pid=30834) self.run()
(Classification pid=30834) File "/usr/local/lib/python3.10/threading.py", line 953, in run
(Classification pid=30834) self._target(*self._args, **self._kwargs)
(Classification pid=30834) File "/home/vscode/.local/lib/python3.10/site-packages/ray/tune/utils/util.py", line 138, in _try_fn
(Classification pid=30834) fn()
(Classification pid=30834) File "/home/vscode/.local/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 589, in <lambda>
(Classification pid=30834) lambda: checkpoint.to_uri(checkpoint_uri),
(Classification pid=30834) File "/home/vscode/.local/lib/python3.10/site-packages/ray/air/checkpoint.py", line 714, in to_uri
(Classification pid=30834) upload_to_uri(local_path=local_path, uri=uri)
(Classification pid=30834) File "/home/vscode/.local/lib/python3.10/site-packages/ray/air/_internal/remote_storage.py", line 220, in upload_to_uri
(Classification pid=30834) pyarrow.fs.copy_files(local_path, bucket_path, destination_filesystem=fs)
(Classification pid=30834) File "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/fs.py", line 267, in copy_files
(Classification pid=30834) _copy_files_selector(source_fs, source_sel,
(Classification pid=30834) File "pyarrow/_fs.pyx", line 1619, in pyarrow._fs._copy_files_selector
(Classification pid=30834) File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
(Classification pid=30834) OSError: When initiating multiple part upload for key 'ray/tune/checkpoints/test/.is_checkpoint' in bucket 'bucket-test': AWS Error NO_SUCH_BUCKET during CreateMultipartUpload operation: The specified bucket does not exist
The problem of uploading can be solved with the sync_up function of the custom Syncer, but even with the sync_down function it’s not possible to resume an experiment, the same endpoint_override problem happens, where an attempt is made to connect the storage to the amazonaws URL .
I think this question may be related to: https://github.com/ray-project/ray/issues/29845 and https://github.com/ray-project/ray/pull/30125.
Versions / Dependencies
ray==2.3.1 pyarrow==11.0.0
Reproduction script
I used as a test an image from minio quay.io/minio/minio (https://min.io/docs/minio/container/index.html)
tuner = tune.Tuner(
tune.with_resources(
tune.with_parameters(Classification),
resources={"cpu": 2, "gpu": gpus_per_trial}
),
run_config=air.RunConfig(
local_dir='checkpoint',
sync_config=tune.SyncConfig(
upload_dir=f's3://bucket-test/ray/tune/checkpoints/test?scheme=http&endpoint_override=localhost:9000',
syncer=CustomSyncer()
),
stop={"training_iteration": 10}
),
tune_config=tune.TuneConfig(
metric="loss",
mode="min",
scheduler=scheduler,
num_samples=num_samples,
),
param_space=config,
)
results = tuner.fit()
Resume an experiment:
from ray import tune
tuner = tune.Tuner.restore(
"s3://bucket-test/ray/tune/checkpoints/test?scheme=http&endpoint_override=localhost:9000",
resume_errored=True
)
tuner.fit()
Issue Severity
High: It blocks me from completing my task.
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 1
- Comments: 16 (9 by maintainers)
Great to hear! It should be packaged with Ray 2.4, which should be released very soon (within 1 week)!
I will leave this thread open for now, since the underlying issue is not fixed yet.
I made a Syncer using boto3 to sync_up and sync_down of the outputs and it worked. Do you have a date for these storage changes that are in the nightly version to be released?
Thank you for your attention and help!
Got it. This does seem to be an Oracle Cloud issue, specifically in the way pyarrow interacts with it. Pyarrow creates this 0-length upload operation, which works for actual AWS/minio/moto, but not for Oracle Cloud it seems. Then, this error we’re seeing is the caught error within pyarrow.
I have a couple of ideas in mind to remove the need for this empty marker file on Tune’s side, but this may not be implemented/fixed for some time.
Workaround
Here is a workaround you could try for now (I don’t have very immediate access to testing on Oracle Cloud) to get around using pyarrow:
Let me know if you have any questions about that suggestion.
@logannas This looks like an AWS credentials error - did you authenticate with your AWS account on all nodes in the Ray cluster? Individual trials (running on worker nodes) will attempt to upload their checkpoints to cloud directly.
You can configure the “command line access” environment variables in a Ray runtime environment as a quick fix to the issue. This will ship the environment variables to all actors/tasks running in the cluster.
Re-opening for now. Let me know how this goes!
Hi @logannas,
This was fixed by this PR https://github.com/ray-project/ray/pull/32576 (which didn’t make it on the 2.3.1 release)!
Please try running with
ray-nightly
. See here for instructions. I’ll close the issue for now, but feel free to re-open if it doesn’t work for you!