ray: EOFError: Ran out of input on Kubernetes Cluster

What is the problem?

I deployed a Kubernetes setup with Ray through the documentation at https://docs.ray.io/en/master/cluster/kubernetes.html#interacting-with-a-ray-cluster when I then submit a job through ray submit my-cluster.yaml myscript.py it returns EOFError: Ran out of input

Ray Version: Latest as defined in nightly builds at https://hub.docker.com/r/rayproject/ray

Stacktrace

2021-03-13 13:06:46,093 INFO command_runner.py:171 -- NodeUpdater: example-cluster-ray-head-mtw85: Running kubectl -n ray exec -it example-cluster-ray-head-mtw85 -- bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (python ~/cartpole2.py)'
Traceback (most recent call last):
  File "/home/ray/cartpole2.py", line 20, in <module>
    agent = ppo.PPOTrainer(config, env=SELECT_ENV)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py", line 121, in __init__
    Trainer.__init__(self, config, env, logger_creator)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 513, in __init__
    super().__init__(config, logger_creator)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 98, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 607, in setup
    self.env_creator = _global_registry.get(ENV_CREATOR, env)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/registry.py", line 140, in get
    return pickle.loads(value)
EOFError: Ran out of input
command terminated with exit code 1

Reproduction (REQUIRED)

Setup a Kubernetes cluster as documented in https://docs.ray.io/en/master/cluster/kubernetes.html#k8s-cluster-launcher
Run the file below by saving it and executing it with ray submit <yaml-step-1> <saved-file.py>

import ray
import ray.rllib.agents.ppo as ppo
import os
import shutil

ray.util.connect("127.0.0.1:10001")

CHECKPOINT_ROOT = "tmp/ppo/cart"
shutil.rmtree(CHECKPOINT_ROOT, ignore_errors=True, onerror=None)

ray_results = os.getenv("HOME") + "/ray_results/"
shutil.rmtree(ray_results, ignore_errors=True, onerror=None)

SELECT_ENV = "CartPole-v0"

config = ppo.DEFAULT_CONFIG.copy()
config["log_level"] = "WARN"

agent = ppo.PPOTrainer(config, env=SELECT_ENV)

N_ITER = 40
s = "{:3d} reward {:6.2f}/{:6.2f}/{:6.2f} len {:6.2f} saved {}"

for n in range(N_ITER):
  result = agent.train()
  file_name = agent.save(CHECKPOINT_ROOT)

  print(s.format(
    n + 1,
    result["episode_reward_min"],
    result["episode_reward_mean"],
    result["episode_reward_max"],
    result["episode_len_mean"],
    file_name
   ))

I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 15 (10 by maintainers)

Most upvoted comments

@DmitriGekhtman can you please follow up on this when you are back in office?

AmeerHajAli on Jun 14, 2021

@richardliaw / @sven1977 can you please answers Xavier’s question?

AmeerHajAli on Apr 26, 2021