ray: rllib train -f compact-regression-test.yaml ERROR on APEX_BreakoutNoFrameskip and DQN_BreakoutNoFrameskip
Is the following dump expected when running ray 1.5.2 with tf=2.5.0 and tf-gpu=2.5.0? CC @amogkam @sven1977 @richardliaw
== Status ==
Memory usage on this node: 4.1/58.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 62.0/128 CPUs, 7.0/8 GPUs, 0.0/323.52 GiB heap, 0.0/141.17 GiB objects (0.0/1.0 CPU_group_10_8c53b838f12138f3fadd8b17f852e4a4, 0.0/1.0 CPU_group_4_8c53b838f12138f3fadd8b17f852e4a4, 0.0/1.0 CPU_group_8_8c53b838f12138f3fadd8b17f852e4a4, 0.0/1.0 CPU_group_9_8c53b838f12138f3fadd8b17f852e4a4, 0.0/1.0 CPU_group_7_8c53b838f12138f3fadd8b17f852e4a4, 0.0/1.0 CPU_group_2_8c53b838f12138f3fadd8b17f852e4a4, 0.0/1.0 CPU_group_1_8c53b838f12138f3fadd8b17f852e4a4, 0.0/8.0 accelerator_type:T4, 0.0/1.0 GPU_group_8c53b838f12138f3fadd8b17f852e4a4, 0.0/1.0 CPU_group_5_8c53b838f12138f3fadd8b17f852e4a4, 0.0/1.0 GPU_group_0_8c53b838f12138f3fadd8b17f852e4a4, 0.0/1.0 CPU_group_0_8c53b838f12138f3fadd8b17f852e4a4, 0.0/11.0 CPU_group_8c53b838f12138f3fadd8b17f852e4a4, 0.0/1.0 CPU_group_3_8c53b838f12138f3fadd8b17f852e4a4, 0.0/1.0 CPU_group_6_8c53b838f12138f3fadd8b17f852e4a4, 0.0/1.0 CPU_group_0_98aec57c6e1dd57a41c824e4eba28755, 0.0/1.0 CPU_group_9_98aec57c6e1dd57a41c824e4eba28755, 0.0/1.0 CPU_group_4_98aec57c6e1dd57a41c824e4eba28755, 0.0/1.0 GPU_group_98aec57c6e1dd57a41c824e4eba28755, 0.0/1.0 CPU_group_3_98aec57c6e1dd57a41c824e4eba28755, 0.0/1.0 CPU_group_7_98aec57c6e1dd57a41c824e4eba28755, 0.0/1.0 CPU_group_2_98aec57c6e1dd57a41c824e4eba28755, 0.0/1.0 CPU_group_5_98aec57c6e1dd57a41c824e4eba28755, 0.0/1.0 CPU_group_6_98aec57c6e1dd57a41c824e4eba28755, 0.0/1.0 CPU_group_10_98aec57c6e1dd57a41c824e4eba28755, 0.0/11.0 CPU_group_98aec57c6e1dd57a41c824e4eba28755, 0.0/1.0 GPU_group_0_98aec57c6e1dd57a41c824e4eba28755, 0.0/1.0 CPU_group_8_98aec57c6e1dd57a41c824e4eba28755, 0.0/1.0 CPU_group_1_98aec57c6e1dd57a41c824e4eba28755, 0.0/1.0 CPU_group_2_236921fdfbf972dccba2643e75f65645, 0.0/1.0 GPU_group_0_236921fdfbf972dccba2643e75f65645, 0.0/1.0 CPU_group_4_236921fdfbf972dccba2643e75f65645, 0.0/1.0 CPU_group_0_236921fdfbf972dccba2643e75f65645, 0.0/1.0 CPU_group_5_236921fdfbf972dccba2643e75f65645, 0.0/1.0 GPU_group_236921fdfbf972dccba2643e75f65645, 0.0/1.0 CPU_group_1_236921fdfbf972dccba2643e75f65645, 0.0/1.0 CPU_group_3_236921fdfbf972dccba2643e75f65645, 0.0/6.0 CPU_group_236921fdfbf972dccba2643e75f65645, 0.0/6.0 CPU_group_fe71b1e5def4ef7cc4258c9468a9ba2c, 0.0/1.0 CPU_group_5_fe71b1e5def4ef7cc4258c9468a9ba2c, 0.0/1.0 CPU_group_2_fe71b1e5def4ef7cc4258c9468a9ba2c, 0.0/1.0 CPU_group_4_fe71b1e5def4ef7cc4258c9468a9ba2c, 0.0/1.0 GPU_group_fe71b1e5def4ef7cc4258c9468a9ba2c, 0.0/1.0 GPU_group_0_fe71b1e5def4ef7cc4258c9468a9ba2c, 0.0/1.0 CPU_group_0_fe71b1e5def4ef7cc4258c9468a9ba2c, 0.0/1.0 CPU_group_1_fe71b1e5def4ef7cc4258c9468a9ba2c, 0.0/1.0 CPU_group_3_fe71b1e5def4ef7cc4258c9468a9ba2c, 0.0/1.0 GPU_group_0bcc573809cb3c2bb478d4050a07668f, 0.0/1.0 CPU_group_10_0bcc573809cb3c2bb478d4050a07668f, 0.0/1.0 CPU_group_2_0bcc573809cb3c2bb478d4050a07668f, 0.0/1.0 CPU_group_1_0bcc573809cb3c2bb478d4050a07668f, 0.0/1.0 CPU_group_8_0bcc573809cb3c2bb478d4050a07668f, 0.0/1.0 CPU_group_3_0bcc573809cb3c2bb478d4050a07668f, 0.0/1.0 CPU_group_4_0bcc573809cb3c2bb478d4050a07668f, 0.0/1.0 CPU_group_5_0bcc573809cb3c2bb478d4050a07668f, 0.0/11.0 CPU_group_0bcc573809cb3c2bb478d4050a07668f, 0.0/1.0 CPU_group_9_0bcc573809cb3c2bb478d4050a07668f, 0.0/1.0 CPU_group_0_0bcc573809cb3c2bb478d4050a07668f, 0.0/1.0 GPU_group_0_0bcc573809cb3c2bb478d4050a07668f, 0.0/1.0 CPU_group_6_0bcc573809cb3c2bb478d4050a07668f, 0.0/1.0 CPU_group_7_0bcc573809cb3c2bb478d4050a07668f, 0.0/1.0 CPU_group_2_22e10f19e3e68fdebbd2c57975deb822, 0.0/1.0 CPU_group_3_22e10f19e3e68fdebbd2c57975deb822, 0.0/1.0 CPU_group_5_22e10f19e3e68fdebbd2c57975deb822, 0.0/1.0 CPU_group_4_22e10f19e3e68fdebbd2c57975deb822, 0.0/1.0 GPU_group_22e10f19e3e68fdebbd2c57975deb822, 0.0/6.0 CPU_group_22e10f19e3e68fdebbd2c57975deb822, 0.0/1.0 CPU_group_1_22e10f19e3e68fdebbd2c57975deb822, 0.0/1.0 GPU_group_0_22e10f19e3e68fdebbd2c57975deb822, 0.0/1.0 CPU_group_0_22e10f19e3e68fdebbd2c57975deb822, 0.0/1.0 GPU_group_aa8d25e2c483c7ae568df84bce73b855, 0.0/1.0 CPU_group_6_aa8d25e2c483c7ae568df84bce73b855, 0.0/11.0 CPU_group_aa8d25e2c483c7ae568df84bce73b855, 0.0/1.0 CPU_group_3_aa8d25e2c483c7ae568df84bce73b855, 0.0/1.0 CPU_group_1_aa8d25e2c483c7ae568df84bce73b855, 0.0/1.0 CPU_group_9_aa8d25e2c483c7ae568df84bce73b855, 0.0/1.0 CPU_group_5_aa8d25e2c483c7ae568df84bce73b855, 0.0/1.0 CPU_group_0_aa8d25e2c483c7ae568df84bce73b855, 0.0/1.0 CPU_group_8_aa8d25e2c483c7ae568df84bce73b855, 0.0/1.0 CPU_group_7_aa8d25e2c483c7ae568df84bce73b855, 0.0/1.0 CPU_group_10_aa8d25e2c483c7ae568df84bce73b855, 0.0/1.0 CPU_group_2_aa8d25e2c483c7ae568df84bce73b855, 0.0/1.0 GPU_group_0_aa8d25e2c483c7ae568df84bce73b855, 0.0/1.0 CPU_group_4_aa8d25e2c483c7ae568df84bce73b855)
Result logdir: /home/ray/ray_results/apex
Result logdir: /home/ray/ray_results/atari-a2c
Result logdir: /home/ray/ray_results/atari-basic-dqn
Result logdir: /home/ray/ray_results/atari-impala
Result logdir: /home/ray/ray_results/atari-ppo-tf
Result logdir: /home/ray/ray_results/atari-ppo-torch
Number of trials: 24/24 (8 ERROR, 7 RUNNING, 9 TERMINATED)
+-------------------------------------------+------------+---------------------+--------+------------------+----------+-----------+----------------------+----------------------+--------------------+
| Trial name | status | loc | iter | total time (s) | ts | reward | episode_reward_max | episode_reward_min | episode_len_mean |
|-------------------------------------------+------------+---------------------+--------+------------------+----------+-----------+----------------------+----------------------+--------------------|
| PPO_BreakoutNoFrameskip-v4_64743_00008 | RUNNING | 192.168.72.3:88129 | 160 | 3390.94 | 800000 | 1.98 | 9 | 0 | 771.74 |
| PPO_BreakoutNoFrameskip-v4_64743_00009 | RUNNING | 192.168.86.3:424549 | 147 | 3132.08 | 735000 | 18.05 | 229 | 3 | 1828.72 |
| PPO_BreakoutNoFrameskip-v4_64743_00010 | RUNNING | 192.168.92.3:55847 | 150 | 3132.49 | 750000 | 10.98 | 28 | 0 | 1938.89 |
| PPO_BreakoutNoFrameskip-v4_64743_00011 | RUNNING | 192.168.85.3:486125 | 100 | 2119.62 | 500000 | 6.68 | 35 | 0 | 1352.42 |
| A2C_BreakoutNoFrameskip-v4_64743_00016 | RUNNING | 192.168.76.4:68 | 178 | 1796.36 | 2893000 | 1.5 | 5 | 0 | 796.8 |
| A2C_BreakoutNoFrameskip-v4_64743_00017 | RUNNING | 192.168.78.5:120 | 144 | 1452.65 | 2268000 | 2.4 | 11 | 0 | 652.591 |
| A2C_BreakoutNoFrameskip-v4_64743_00019 | RUNNING | 192.168.75.6:91 | 74 | 741.496 | 1207000 | 2.55357 | 11 | 0 | 661.473 |
| IMPALA_BreakoutNoFrameskip-v4_64743_00000 | TERMINATED | | 355 | 3609.17 | 11706500 | 361.86 | 431 | 15 | 8202.21 |
| IMPALA_BreakoutNoFrameskip-v4_64743_00001 | TERMINATED | | 357 | 3609.16 | 13093000 | 407.58 | 820 | 147 | 11542.9 |
| IMPALA_BreakoutNoFrameskip-v4_64743_00002 | TERMINATED | | 357 | 3600.9 | 12733000 | 406.83 | 784 | 249 | 10847 |
| IMPALA_BreakoutNoFrameskip-v4_64743_00003 | TERMINATED | | 358 | 3603.68 | 12738000 | 389.61 | 445 | 38 | 10630.2 |
| PPO_BreakoutNoFrameskip-v4_64743_00004 | TERMINATED | | 893 | 3602.84 | 4465000 | 17.08 | 33 | 5 | 2184.17 |
| PPO_BreakoutNoFrameskip-v4_64743_00005 | TERMINATED | | 944 | 3602.96 | 4720000 | 52.93 | 353 | 7 | 3125 |
| PPO_BreakoutNoFrameskip-v4_64743_00006 | TERMINATED | | 924 | 3600.57 | 4620000 | 30.81 | 64 | 11 | 3521.53 |
| APEX_BreakoutNoFrameskip-v4_64743_00012 | ERROR | | | | | | | | |
| APEX_BreakoutNoFrameskip-v4_64743_00013 | ERROR | | 10 | 398.863 | 727840 | 5.78 | 14 | 1 | 1216.72 |
| APEX_BreakoutNoFrameskip-v4_64743_00014 | ERROR | | 10 | 393.234 | 722720 | 5.95 | 20 | 0 | 1260.04 |
| APEX_BreakoutNoFrameskip-v4_64743_00015 | ERROR | | 10 | 405.066 | 724480 | 5.32353 | 15 | 1 | 1192.84 |
| DQN_BreakoutNoFrameskip-v4_64743_00020 | ERROR | | 16 | 1371.5 | 170000 | 2.29 | 13 | 0 | 874.5 |
| DQN_BreakoutNoFrameskip-v4_64743_00021 | ERROR | | 16 | 1370.3 | 170000 | 3.79 | 19 | 0 | 1057.59 |
| DQN_BreakoutNoFrameskip-v4_64743_00022 | ERROR | | 16 | 1374.14 | 170000 | 1.78 | 11 | 0 | 758.03 |
+-------------------------------------------+------------+---------------------+--------+------------------+----------+-----------+----------------------+----------------------+--------------------+
... 4 more trials not shown (2 TERMINATED, 1 ERROR)
Number of errored trials: 8
+-----------------------------------------+--------------+---------------------------------------------------------------------------------------------------------------+
| Trial name | # failures | error file |
|-----------------------------------------+--------------+---------------------------------------------------------------------------------------------------------------|
| APEX_BreakoutNoFrameskip-v4_64743_00012 | 1 | /home/ray/ray_results/apex/APEX_BreakoutNoFrameskip-v4_64743_00012_12_2021-08-22_15-02-12/error.txt |
| APEX_BreakoutNoFrameskip-v4_64743_00013 | 1 | /home/ray/ray_results/apex/APEX_BreakoutNoFrameskip-v4_64743_00013_13_2021-08-22_14-58-57/error.txt |
| APEX_BreakoutNoFrameskip-v4_64743_00014 | 1 | /home/ray/ray_results/apex/APEX_BreakoutNoFrameskip-v4_64743_00014_14_2021-08-22_14-54-42/error.txt |
| APEX_BreakoutNoFrameskip-v4_64743_00015 | 1 | /home/ray/ray_results/apex/APEX_BreakoutNoFrameskip-v4_64743_00015_15_2021-08-22_15-02-45/error.txt |
| DQN_BreakoutNoFrameskip-v4_64743_00020 | 1 | /home/ray/ray_results/atari-basic-dqn/DQN_BreakoutNoFrameskip-v4_64743_00020_20_2021-08-22_15-02-22/error.txt |
| DQN_BreakoutNoFrameskip-v4_64743_00021 | 1 | /home/ray/ray_results/atari-basic-dqn/DQN_BreakoutNoFrameskip-v4_64743_00021_21_2021-08-22_15-02-22/error.txt |
| DQN_BreakoutNoFrameskip-v4_64743_00022 | 1 | /home/ray/ray_results/atari-basic-dqn/DQN_BreakoutNoFrameskip-v4_64743_00022_22_2021-08-22_15-02-22/error.txt |
| DQN_BreakoutNoFrameskip-v4_64743_00023 | 1 | /home/ray/ray_results/atari-basic-dqn/DQN_BreakoutNoFrameskip-v4_64743_00023_23_2021-08-22_15-02-22/error.txt |
+-----------------------------------------+--------------+---------------------------------------------------------------------------------------------------------------+
What is the problem?
Ray version and other system information (Python version, TensorFlow version, OS):
Reproduction (REQUIRED)
Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):
If the code snippet cannot be run by itself, the issue will be closed with “needs-repro-script”.
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 15 (14 by maintainers)
@sven1977 checkout this one: https://beta.anyscale.com/o/anyscale-internal/projects/prj_MyVW7bByg2bLbXNU6mupEDiL/clusters/ses_Zs3sW7dQQUKqi8bfA5tPBfUY
@AmeerHajAli ^