ray: rllib train -f compact-regression-test.yaml ERROR on APEX_BreakoutNoFrameskip and DQN_BreakoutNoFrameskip

Is the following dump expected when running ray 1.5.2 with tf=2.5.0 and tf-gpu=2.5.0? CC @amogkam @sven1977 @richardliaw

== Status ==
Memory usage on this node: 4.1/58.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 62.0/128 CPUs, 7.0/8 GPUs, 0.0/323.52 GiB heap, 0.0/141.17 GiB objects (0.0/1.0 CPU_group_10_8c53b838f12138f3fadd8b17f852e4a4, 0.0/1.0 CPU_group_4_8c53b838f12138f3fadd8b17f852e4a4, 0.0/1.0 CPU_group_8_8c53b838f12138f3fadd8b17f852e4a4, 0.0/1.0 CPU_group_9_8c53b838f12138f3fadd8b17f852e4a4, 0.0/1.0 CPU_group_7_8c53b838f12138f3fadd8b17f852e4a4, 0.0/1.0 CPU_group_2_8c53b838f12138f3fadd8b17f852e4a4, 0.0/1.0 CPU_group_1_8c53b838f12138f3fadd8b17f852e4a4, 0.0/8.0 accelerator_type:T4, 0.0/1.0 GPU_group_8c53b838f12138f3fadd8b17f852e4a4, 0.0/1.0 CPU_group_5_8c53b838f12138f3fadd8b17f852e4a4, 0.0/1.0 GPU_group_0_8c53b838f12138f3fadd8b17f852e4a4, 0.0/1.0 CPU_group_0_8c53b838f12138f3fadd8b17f852e4a4, 0.0/11.0 CPU_group_8c53b838f12138f3fadd8b17f852e4a4, 0.0/1.0 CPU_group_3_8c53b838f12138f3fadd8b17f852e4a4, 0.0/1.0 CPU_group_6_8c53b838f12138f3fadd8b17f852e4a4, 0.0/1.0 CPU_group_0_98aec57c6e1dd57a41c824e4eba28755, 0.0/1.0 CPU_group_9_98aec57c6e1dd57a41c824e4eba28755, 0.0/1.0 CPU_group_4_98aec57c6e1dd57a41c824e4eba28755, 0.0/1.0 GPU_group_98aec57c6e1dd57a41c824e4eba28755, 0.0/1.0 CPU_group_3_98aec57c6e1dd57a41c824e4eba28755, 0.0/1.0 CPU_group_7_98aec57c6e1dd57a41c824e4eba28755, 0.0/1.0 CPU_group_2_98aec57c6e1dd57a41c824e4eba28755, 0.0/1.0 CPU_group_5_98aec57c6e1dd57a41c824e4eba28755, 0.0/1.0 CPU_group_6_98aec57c6e1dd57a41c824e4eba28755, 0.0/1.0 CPU_group_10_98aec57c6e1dd57a41c824e4eba28755, 0.0/11.0 CPU_group_98aec57c6e1dd57a41c824e4eba28755, 0.0/1.0 GPU_group_0_98aec57c6e1dd57a41c824e4eba28755, 0.0/1.0 CPU_group_8_98aec57c6e1dd57a41c824e4eba28755, 0.0/1.0 CPU_group_1_98aec57c6e1dd57a41c824e4eba28755, 0.0/1.0 CPU_group_2_236921fdfbf972dccba2643e75f65645, 0.0/1.0 GPU_group_0_236921fdfbf972dccba2643e75f65645, 0.0/1.0 CPU_group_4_236921fdfbf972dccba2643e75f65645, 0.0/1.0 CPU_group_0_236921fdfbf972dccba2643e75f65645, 0.0/1.0 CPU_group_5_236921fdfbf972dccba2643e75f65645, 0.0/1.0 GPU_group_236921fdfbf972dccba2643e75f65645, 0.0/1.0 CPU_group_1_236921fdfbf972dccba2643e75f65645, 0.0/1.0 CPU_group_3_236921fdfbf972dccba2643e75f65645, 0.0/6.0 CPU_group_236921fdfbf972dccba2643e75f65645, 0.0/6.0 CPU_group_fe71b1e5def4ef7cc4258c9468a9ba2c, 0.0/1.0 CPU_group_5_fe71b1e5def4ef7cc4258c9468a9ba2c, 0.0/1.0 CPU_group_2_fe71b1e5def4ef7cc4258c9468a9ba2c, 0.0/1.0 CPU_group_4_fe71b1e5def4ef7cc4258c9468a9ba2c, 0.0/1.0 GPU_group_fe71b1e5def4ef7cc4258c9468a9ba2c, 0.0/1.0 GPU_group_0_fe71b1e5def4ef7cc4258c9468a9ba2c, 0.0/1.0 CPU_group_0_fe71b1e5def4ef7cc4258c9468a9ba2c, 0.0/1.0 CPU_group_1_fe71b1e5def4ef7cc4258c9468a9ba2c, 0.0/1.0 CPU_group_3_fe71b1e5def4ef7cc4258c9468a9ba2c, 0.0/1.0 GPU_group_0bcc573809cb3c2bb478d4050a07668f, 0.0/1.0 CPU_group_10_0bcc573809cb3c2bb478d4050a07668f, 0.0/1.0 CPU_group_2_0bcc573809cb3c2bb478d4050a07668f, 0.0/1.0 CPU_group_1_0bcc573809cb3c2bb478d4050a07668f, 0.0/1.0 CPU_group_8_0bcc573809cb3c2bb478d4050a07668f, 0.0/1.0 CPU_group_3_0bcc573809cb3c2bb478d4050a07668f, 0.0/1.0 CPU_group_4_0bcc573809cb3c2bb478d4050a07668f, 0.0/1.0 CPU_group_5_0bcc573809cb3c2bb478d4050a07668f, 0.0/11.0 CPU_group_0bcc573809cb3c2bb478d4050a07668f, 0.0/1.0 CPU_group_9_0bcc573809cb3c2bb478d4050a07668f, 0.0/1.0 CPU_group_0_0bcc573809cb3c2bb478d4050a07668f, 0.0/1.0 GPU_group_0_0bcc573809cb3c2bb478d4050a07668f, 0.0/1.0 CPU_group_6_0bcc573809cb3c2bb478d4050a07668f, 0.0/1.0 CPU_group_7_0bcc573809cb3c2bb478d4050a07668f, 0.0/1.0 CPU_group_2_22e10f19e3e68fdebbd2c57975deb822, 0.0/1.0 CPU_group_3_22e10f19e3e68fdebbd2c57975deb822, 0.0/1.0 CPU_group_5_22e10f19e3e68fdebbd2c57975deb822, 0.0/1.0 CPU_group_4_22e10f19e3e68fdebbd2c57975deb822, 0.0/1.0 GPU_group_22e10f19e3e68fdebbd2c57975deb822, 0.0/6.0 CPU_group_22e10f19e3e68fdebbd2c57975deb822, 0.0/1.0 CPU_group_1_22e10f19e3e68fdebbd2c57975deb822, 0.0/1.0 GPU_group_0_22e10f19e3e68fdebbd2c57975deb822, 0.0/1.0 CPU_group_0_22e10f19e3e68fdebbd2c57975deb822, 0.0/1.0 GPU_group_aa8d25e2c483c7ae568df84bce73b855, 0.0/1.0 CPU_group_6_aa8d25e2c483c7ae568df84bce73b855, 0.0/11.0 CPU_group_aa8d25e2c483c7ae568df84bce73b855, 0.0/1.0 CPU_group_3_aa8d25e2c483c7ae568df84bce73b855, 0.0/1.0 CPU_group_1_aa8d25e2c483c7ae568df84bce73b855, 0.0/1.0 CPU_group_9_aa8d25e2c483c7ae568df84bce73b855, 0.0/1.0 CPU_group_5_aa8d25e2c483c7ae568df84bce73b855, 0.0/1.0 CPU_group_0_aa8d25e2c483c7ae568df84bce73b855, 0.0/1.0 CPU_group_8_aa8d25e2c483c7ae568df84bce73b855, 0.0/1.0 CPU_group_7_aa8d25e2c483c7ae568df84bce73b855, 0.0/1.0 CPU_group_10_aa8d25e2c483c7ae568df84bce73b855, 0.0/1.0 CPU_group_2_aa8d25e2c483c7ae568df84bce73b855, 0.0/1.0 GPU_group_0_aa8d25e2c483c7ae568df84bce73b855, 0.0/1.0 CPU_group_4_aa8d25e2c483c7ae568df84bce73b855)
Result logdir: /home/ray/ray_results/apex
Result logdir: /home/ray/ray_results/atari-a2c
Result logdir: /home/ray/ray_results/atari-basic-dqn
Result logdir: /home/ray/ray_results/atari-impala
Result logdir: /home/ray/ray_results/atari-ppo-tf
Result logdir: /home/ray/ray_results/atari-ppo-torch
Number of trials: 24/24 (8 ERROR, 7 RUNNING, 9 TERMINATED)
+-------------------------------------------+------------+---------------------+--------+------------------+----------+-----------+----------------------+----------------------+--------------------+
| Trial name                                | status     | loc                 |   iter |   total time (s) |       ts |    reward |   episode_reward_max |   episode_reward_min |   episode_len_mean |
|-------------------------------------------+------------+---------------------+--------+------------------+----------+-----------+----------------------+----------------------+--------------------|
| PPO_BreakoutNoFrameskip-v4_64743_00008    | RUNNING    | 192.168.72.3:88129  |    160 |         3390.94  |   800000 |   1.98    |                    9 |                    0 |            771.74  |
| PPO_BreakoutNoFrameskip-v4_64743_00009    | RUNNING    | 192.168.86.3:424549 |    147 |         3132.08  |   735000 |  18.05    |                  229 |                    3 |           1828.72  |
| PPO_BreakoutNoFrameskip-v4_64743_00010    | RUNNING    | 192.168.92.3:55847  |    150 |         3132.49  |   750000 |  10.98    |                   28 |                    0 |           1938.89  |
| PPO_BreakoutNoFrameskip-v4_64743_00011    | RUNNING    | 192.168.85.3:486125 |    100 |         2119.62  |   500000 |   6.68    |                   35 |                    0 |           1352.42  |
| A2C_BreakoutNoFrameskip-v4_64743_00016    | RUNNING    | 192.168.76.4:68     |    178 |         1796.36  |  2893000 |   1.5     |                    5 |                    0 |            796.8   |
| A2C_BreakoutNoFrameskip-v4_64743_00017    | RUNNING    | 192.168.78.5:120    |    144 |         1452.65  |  2268000 |   2.4     |                   11 |                    0 |            652.591 |
| A2C_BreakoutNoFrameskip-v4_64743_00019    | RUNNING    | 192.168.75.6:91     |     74 |          741.496 |  1207000 |   2.55357 |                   11 |                    0 |            661.473 |
| IMPALA_BreakoutNoFrameskip-v4_64743_00000 | TERMINATED |                     |    355 |         3609.17  | 11706500 | 361.86    |                  431 |                   15 |           8202.21  |
| IMPALA_BreakoutNoFrameskip-v4_64743_00001 | TERMINATED |                     |    357 |         3609.16  | 13093000 | 407.58    |                  820 |                  147 |          11542.9   |
| IMPALA_BreakoutNoFrameskip-v4_64743_00002 | TERMINATED |                     |    357 |         3600.9   | 12733000 | 406.83    |                  784 |                  249 |          10847     |
| IMPALA_BreakoutNoFrameskip-v4_64743_00003 | TERMINATED |                     |    358 |         3603.68  | 12738000 | 389.61    |                  445 |                   38 |          10630.2   |
| PPO_BreakoutNoFrameskip-v4_64743_00004    | TERMINATED |                     |    893 |         3602.84  |  4465000 |  17.08    |                   33 |                    5 |           2184.17  |
| PPO_BreakoutNoFrameskip-v4_64743_00005    | TERMINATED |                     |    944 |         3602.96  |  4720000 |  52.93    |                  353 |                    7 |           3125     |
| PPO_BreakoutNoFrameskip-v4_64743_00006    | TERMINATED |                     |    924 |         3600.57  |  4620000 |  30.81    |                   64 |                   11 |           3521.53  |
| APEX_BreakoutNoFrameskip-v4_64743_00012   | ERROR      |                     |        |                  |          |           |                      |                      |                    |
| APEX_BreakoutNoFrameskip-v4_64743_00013   | ERROR      |                     |     10 |          398.863 |   727840 |   5.78    |                   14 |                    1 |           1216.72  |
| APEX_BreakoutNoFrameskip-v4_64743_00014   | ERROR      |                     |     10 |          393.234 |   722720 |   5.95    |                   20 |                    0 |           1260.04  |
| APEX_BreakoutNoFrameskip-v4_64743_00015   | ERROR      |                     |     10 |          405.066 |   724480 |   5.32353 |                   15 |                    1 |           1192.84  |
| DQN_BreakoutNoFrameskip-v4_64743_00020    | ERROR      |                     |     16 |         1371.5   |   170000 |   2.29    |                   13 |                    0 |            874.5   |
| DQN_BreakoutNoFrameskip-v4_64743_00021    | ERROR      |                     |     16 |         1370.3   |   170000 |   3.79    |                   19 |                    0 |           1057.59  |
| DQN_BreakoutNoFrameskip-v4_64743_00022    | ERROR      |                     |     16 |         1374.14  |   170000 |   1.78    |                   11 |                    0 |            758.03  |
+-------------------------------------------+------------+---------------------+--------+------------------+----------+-----------+----------------------+----------------------+--------------------+
... 4 more trials not shown (2 TERMINATED, 1 ERROR)
Number of errored trials: 8
+-----------------------------------------+--------------+---------------------------------------------------------------------------------------------------------------+
| Trial name                              |   # failures | error file                                                                                                    |
|-----------------------------------------+--------------+---------------------------------------------------------------------------------------------------------------|
| APEX_BreakoutNoFrameskip-v4_64743_00012 |            1 | /home/ray/ray_results/apex/APEX_BreakoutNoFrameskip-v4_64743_00012_12_2021-08-22_15-02-12/error.txt           |
| APEX_BreakoutNoFrameskip-v4_64743_00013 |            1 | /home/ray/ray_results/apex/APEX_BreakoutNoFrameskip-v4_64743_00013_13_2021-08-22_14-58-57/error.txt           |
| APEX_BreakoutNoFrameskip-v4_64743_00014 |            1 | /home/ray/ray_results/apex/APEX_BreakoutNoFrameskip-v4_64743_00014_14_2021-08-22_14-54-42/error.txt           |
| APEX_BreakoutNoFrameskip-v4_64743_00015 |            1 | /home/ray/ray_results/apex/APEX_BreakoutNoFrameskip-v4_64743_00015_15_2021-08-22_15-02-45/error.txt           |
| DQN_BreakoutNoFrameskip-v4_64743_00020  |            1 | /home/ray/ray_results/atari-basic-dqn/DQN_BreakoutNoFrameskip-v4_64743_00020_20_2021-08-22_15-02-22/error.txt |
| DQN_BreakoutNoFrameskip-v4_64743_00021  |            1 | /home/ray/ray_results/atari-basic-dqn/DQN_BreakoutNoFrameskip-v4_64743_00021_21_2021-08-22_15-02-22/error.txt |
| DQN_BreakoutNoFrameskip-v4_64743_00022  |            1 | /home/ray/ray_results/atari-basic-dqn/DQN_BreakoutNoFrameskip-v4_64743_00022_22_2021-08-22_15-02-22/error.txt |
| DQN_BreakoutNoFrameskip-v4_64743_00023  |            1 | /home/ray/ray_results/atari-basic-dqn/DQN_BreakoutNoFrameskip-v4_64743_00023_23_2021-08-22_15-02-22/error.txt |
+-----------------------------------------+--------------+---------------------------------------------------------------------------------------------------------------+

What is the problem?

Ray version and other system information (Python version, TensorFlow version, OS):

Reproduction (REQUIRED)

Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):

If the code snippet cannot be run by itself, the issue will be closed with “needs-repro-script”.

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 15 (14 by maintainers)

Most upvoted comments