ray: [rllib] Atari broken in 0.7.5+ since RLlib chooses wrong neural net model by default

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
Ray installed from (source or binary): binary
Ray version: 0.7.6
Python version: 3.6.8
Exact command to reproduce: python3 train.py -f pong-appo.yaml using the rllib train.py and the tuned APPO pong yaml file.

Describe the problem

Upon finishing training (termination at 5M steps as in the config), the reward is still around -20, which is the initial reward of a random agent. The comments in the tuned example say

# This can reach 18-19 reward in ~5-7 minutes on a Titan XP GPU
# with 32 workers and 8 envs per worker. IMPALA, when ran with 
# similar configurations, solved Pong in 10-12 minutes.
# APPO can also solve Pong in 2.5 million timesteps, which is
# 2x more efficient than that of IMPALA.

which I cannot reproduce.

Training seemed to go smoothly, I didn’t see any errors, except RuntimeWarning: Mean of empty slice. and RuntimeWarning: invalid value encountered in double_scalars at the beginning of training mentioned in #5520.

Source code / logs

The final training step logs:

Result for APPO_PongNoFrameskip-v4_0:
  custom_metrics: {}
  date: 2019-10-31_18-56-51
  done: true
  episode_len_mean: 3710.01
  episode_reward_max: -18.0
  episode_reward_mean: -20.35
  episode_reward_min: -21.0
  episodes_this_iter: 88
  episodes_total: 5366
  experiment_id: e9ccd551521a44e287451f8d87dd7dbe
  hostname: test03-vgqp8
  info:
    learner:
      cur_lr: 0.0005000000237487257
      entropy: 1.7659618854522705
      mean_IS: 1.1852530241012573
      model: {}
      policy_loss: -0.003545303363353014
      var_IS: 0.21974682807922363
      var_gnorm: 23.188478469848633
      vf_explained_var: 0.0
      vf_loss: 0.01947147212922573
    learner_queue:
      size_count: 12504
      size_mean: 14.46
      size_quantiles:
      - 12.0
      - 13.0
      - 15.0
      - 16.0
      - 16.0
      size_std: 1.0432641084595982
    num_steps_replayed: 0
    num_steps_sampled: 5012800
    num_steps_trained: 9999200
    num_weight_syncs: 12532
    sample_throughput: 6554.589
    timing_breakdown:
      learner_dequeue_time_ms: 0.018
      learner_grad_time_ms: 137.841
      learner_load_time_ms: .nan
      learner_load_wait_time_ms: .nan
      optimizer_step_time_ms: 672.661
    train_throughput: 11854.045
  iterations_since_restore: 59
  node_ip: 192.168.2.40
  num_healthy_workers: 32
  off_policy_estimator: {}
  pid: 34
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_env_wait_ms: 10.214430495025196
    mean_inference_ms: 1.736408154661836
    mean_processing_ms: 0.5789328915422826
  time_since_restore: 632.1431384086609
  time_this_iter_s: 11.452256441116333
  time_total_s: 632.1431384086609
  timestamp: 1572548211
  timesteps_since_restore: 5012800
  timesteps_this_iter: 75200
  timesteps_total: 5012800
  training_iteration: 59
  trial_id: b183a16a
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/65 CPUs, 0/1 GPUs, 0.0/193.7 GiB heap, 0.0/39.6 GiB objects
Memory usage on this node: 24.5/60.0 GiB
Result logdir: /root/ray_results/pong-appo
Number of trials: 1 ({'TERMINATED': 1})
TERMINATED trials:
 - APPO_PongNoFrameskip-v4_0:   TERMINATED, [33 CPUs, 1 GPUs], [pid=34], 632 s, 59 iter, 5012800 ts, -20.4 rew

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 22 (12 by maintainers)

Most upvoted comments

It seems it works in Ray 0.7.4.

RUNNING trials:
 - PPO_BreakoutNoFrameskip-v4_0_env=BreakoutNoFrameskip-v4:     RUNNING, [11 CPUs, 1 GPUs], [pid=84978], 274 s, 73 iter, 365000 ts, 7.45 rew
 - PPO_BreakoutNoFrameskip-v4_1_env=BreakoutNoFrameskip-v4:     RUNNING, [11 CPUs, 1 GPUs], [pid=84992], 275 s, 74 iter, 370000 ts, 9.37 rew
 - PPO_BreakoutNoFrameskip-v4_2_env=BreakoutNoFrameskip-v4:     RUNNING, [11 CPUs, 1 GPUs], [pid=85027], 274 s, 74 iter, 370000 ts, 3.6 rew
 - PPO_BreakoutNoFrameskip-v4_3_env=BreakoutNoFrameskip-v4:     RUNNING, [11 CPUs, 1 GPUs], [pid=85018], 277 s, 75 iter, 375000 ts, 12.1 rew

But not 0.7.5+ (<= 2 reward for Breakout no matter how long).

ericl on Oct 31, 2019