ray: [rllib] Atari broken in 0.7.5+ since RLlib chooses wrong neural net model by default
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
- Ray installed from (source or binary): binary
- Ray version: 0.7.6
- Python version: 3.6.8
- Exact command to reproduce:
python3 train.py -f pong-appo.yamlusing the rllib train.py and the tuned APPO pong yaml file.
Describe the problem
Upon finishing training (termination at 5M steps as in the config), the reward is still around -20, which is the initial reward of a random agent. The comments in the tuned example say
# This can reach 18-19 reward in ~5-7 minutes on a Titan XP GPU
# with 32 workers and 8 envs per worker. IMPALA, when ran with
# similar configurations, solved Pong in 10-12 minutes.
# APPO can also solve Pong in 2.5 million timesteps, which is
# 2x more efficient than that of IMPALA.
which I cannot reproduce.
Training seemed to go smoothly, I didn’t see any errors, except RuntimeWarning: Mean of empty slice. and RuntimeWarning: invalid value encountered in double_scalars at the beginning of training mentioned in #5520.
Source code / logs
The final training step logs:
Result for APPO_PongNoFrameskip-v4_0:
custom_metrics: {}
date: 2019-10-31_18-56-51
done: true
episode_len_mean: 3710.01
episode_reward_max: -18.0
episode_reward_mean: -20.35
episode_reward_min: -21.0
episodes_this_iter: 88
episodes_total: 5366
experiment_id: e9ccd551521a44e287451f8d87dd7dbe
hostname: test03-vgqp8
info:
learner:
cur_lr: 0.0005000000237487257
entropy: 1.7659618854522705
mean_IS: 1.1852530241012573
model: {}
policy_loss: -0.003545303363353014
var_IS: 0.21974682807922363
var_gnorm: 23.188478469848633
vf_explained_var: 0.0
vf_loss: 0.01947147212922573
learner_queue:
size_count: 12504
size_mean: 14.46
size_quantiles:
- 12.0
- 13.0
- 15.0
- 16.0
- 16.0
size_std: 1.0432641084595982
num_steps_replayed: 0
num_steps_sampled: 5012800
num_steps_trained: 9999200
num_weight_syncs: 12532
sample_throughput: 6554.589
timing_breakdown:
learner_dequeue_time_ms: 0.018
learner_grad_time_ms: 137.841
learner_load_time_ms: .nan
learner_load_wait_time_ms: .nan
optimizer_step_time_ms: 672.661
train_throughput: 11854.045
iterations_since_restore: 59
node_ip: 192.168.2.40
num_healthy_workers: 32
off_policy_estimator: {}
pid: 34
policy_reward_max: {}
policy_reward_mean: {}
policy_reward_min: {}
sampler_perf:
mean_env_wait_ms: 10.214430495025196
mean_inference_ms: 1.736408154661836
mean_processing_ms: 0.5789328915422826
time_since_restore: 632.1431384086609
time_this_iter_s: 11.452256441116333
time_total_s: 632.1431384086609
timestamp: 1572548211
timesteps_since_restore: 5012800
timesteps_this_iter: 75200
timesteps_total: 5012800
training_iteration: 59
trial_id: b183a16a
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/65 CPUs, 0/1 GPUs, 0.0/193.7 GiB heap, 0.0/39.6 GiB objects
Memory usage on this node: 24.5/60.0 GiB
Result logdir: /root/ray_results/pong-appo
Number of trials: 1 ({'TERMINATED': 1})
TERMINATED trials:
- APPO_PongNoFrameskip-v4_0: TERMINATED, [33 CPUs, 1 GPUs], [pid=34], 632 s, 59 iter, 5012800 ts, -20.4 rew
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 22 (12 by maintainers)
It seems it works in Ray 0.7.4.
But not 0.7.5+ (<= 2 reward for Breakout no matter how long).