ray: EOFError: Ran out of input on Kubernetes Cluster
What is the problem?
I deployed a Kubernetes setup with Ray through the documentation at https://docs.ray.io/en/master/cluster/kubernetes.html#interacting-with-a-ray-cluster when I then submit a job through ray submit my-cluster.yaml myscript.py it returns EOFError: Ran out of input
- Ray Version: Latest as defined in nightly builds at https://hub.docker.com/r/rayproject/ray
Stacktrace
2021-03-13 13:06:46,093 INFO command_runner.py:171 -- NodeUpdater: example-cluster-ray-head-mtw85: Running kubectl -n ray exec -it example-cluster-ray-head-mtw85 -- bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (python ~/cartpole2.py)'
Traceback (most recent call last):
File "/home/ray/cartpole2.py", line 20, in <module>
agent = ppo.PPOTrainer(config, env=SELECT_ENV)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py", line 121, in __init__
Trainer.__init__(self, config, env, logger_creator)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 513, in __init__
super().__init__(config, logger_creator)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 98, in __init__
self.setup(copy.deepcopy(self.config))
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 607, in setup
self.env_creator = _global_registry.get(ENV_CREATOR, env)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/registry.py", line 140, in get
return pickle.loads(value)
EOFError: Ran out of input
command terminated with exit code 1
Reproduction (REQUIRED)
- Setup a Kubernetes cluster as documented in https://docs.ray.io/en/master/cluster/kubernetes.html#k8s-cluster-launcher
- Run the file below by saving it and executing it with
ray submit <yaml-step-1> <saved-file.py>
import ray
import ray.rllib.agents.ppo as ppo
import os
import shutil
ray.util.connect("127.0.0.1:10001")
CHECKPOINT_ROOT = "tmp/ppo/cart"
shutil.rmtree(CHECKPOINT_ROOT, ignore_errors=True, onerror=None)
ray_results = os.getenv("HOME") + "/ray_results/"
shutil.rmtree(ray_results, ignore_errors=True, onerror=None)
SELECT_ENV = "CartPole-v0"
config = ppo.DEFAULT_CONFIG.copy()
config["log_level"] = "WARN"
agent = ppo.PPOTrainer(config, env=SELECT_ENV)
N_ITER = 40
s = "{:3d} reward {:6.2f}/{:6.2f}/{:6.2f} len {:6.2f} saved {}"
for n in range(N_ITER):
result = agent.train()
file_name = agent.save(CHECKPOINT_ROOT)
print(s.format(
n + 1,
result["episode_reward_min"],
result["episode_reward_mean"],
result["episode_reward_max"],
result["episode_len_mean"],
file_name
))
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 15 (10 by maintainers)
@DmitriGekhtman can you please follow up on this when you are back in office?
@richardliaw / @sven1977 can you please answers Xavier’s question?