ray: [Release 1.11.0] job submission error
On releases/1.11.0 branch, there are job submission errors in rte_ray_client and train_small:
https://buildkite.com/ray-project/periodic-ci/builds/2788#6a73aaf1-80f7-40bf-9b8b-0f21c91e6e57/136-545
https://buildkite.com/ray-project/periodic-ci/builds/2788#1bdebe61-370e-4d93-a979-402732826c34/136-542
These seem like mismatched command handling in product vs in the job client. Can anyone advise on the commit to cherrypick to fix this? e.g. would it be #22011, #22209, or something else? cc @edoakes @simon-mo @krfricke. Assigning to @architkulkarni as Platform oncall.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 16 (16 by maintainers)
Picking https://github.com/ray-project/ray/pull/22011 sounds good. One possibility is that
rte_ray_clientandtrain_smalluse Ray client (use_connect: True), and the codepath for that ine2e.pyis different. I will send out a PR.IIUC, the previous job command before the
wait_cluster.pycall installsawscliand copieswait_cluster.pyand other local files to the Anyscale session: https://github.com/ray-project/ray/blob/8b1bbfe8e438a06bf2f9fe2cbf65f163d64227dd/release/e2e.py#L506-L512 Because the job fails, thewait_cluster.pyfile is missing.If it doesn’t work, you can bring in this commit which does not ray job submit runners https://github.com/ray-project/ray/commit/ac00389cbe4c58573a6b5949d6eba3bf13387f01