ray: [Release 1.11.0] job submission error

On releases/1.11.0 branch, there are job submission errors in rte_ray_client and train_small: https://buildkite.com/ray-project/periodic-ci/builds/2788#6a73aaf1-80f7-40bf-9b8b-0f21c91e6e57/136-545 https://buildkite.com/ray-project/periodic-ci/builds/2788#1bdebe61-370e-4d93-a979-402732826c34/136-542

These seem like mismatched command handling in product vs in the job client. Can anyone advise on the commit to cherrypick to fix this? e.g. would it be #22011, #22209, or something else? cc @edoakes @simon-mo @krfricke. Assigning to @architkulkarni as Platform oncall.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 16 (16 by maintainers)

Most upvoted comments

Picking https://github.com/ray-project/ray/pull/22011 sounds good. One possibility is that rte_ray_client and train_small use Ray client (use_connect: True), and the codepath for that in e2e.py is different. I will send out a PR.

IIUC, the previous job command before the wait_cluster.py call installs awscli and copies wait_cluster.py and other local files to the Anyscale session: https://github.com/ray-project/ray/blob/8b1bbfe8e438a06bf2f9fe2cbf65f163d64227dd/release/e2e.py#L506-L512 Because the job fails, the wait_cluster.py file is missing.

If it doesn’t work, you can bring in this commit which does not ray job submit runners https://github.com/ray-project/ray/commit/ac00389cbe4c58573a6b5949d6eba3bf13387f01