ray: [rllib] Inconsistent batch size and training slowdown
What is the problem?
There are two issues I’m seeing, trying to migrate Ray 0.7.3 to Ray 0.9.0dev:
- Using the following tune config:
config.update({
"use_pytorch": True,
"num_workers": 10,
"num_envs_per_worker": 5,
"batch_mode": "complete_episodes",
"rollout_fragment_length": 100,
"train_batch_size": 5000,
})
I’m getting sometimes training batches of size 500, sometimes 5000, sometimes way bigger. See result snapshots from console for example:
== Status == Memory usage on this node: 8.8/62.6 GiB Using FIFO scheduling algorithm. Resources requested: 11/12 CPUs, 0/1 GPUs, 0.0/36.08 GiB heap, 0.0/12.45 GiB objects Result logdir: /home/kz430x/ray_results/841d9a36-9f61-11ea-b4a3-0242ac110002 Number of trials: 10 (9 PENDING, 1 RUNNING) ±---------------------------------±---------±----------------±--------±------±------------------------------------±-------±-----------------±-----±---------+ | Trial name | status | loc | gamma | lr | model/custom_options/architecture | iter | total time (s) | ts | reward | |----------------------------------±---------±----------------±--------±------±------------------------------------±-------±-----------------±-----±---------| | A3C_FourLaneLCFTREnv_84276_00000 | RUNNING | 172.17.0.2:6603 | 0.975 | 2e-05 | VOLVONET_V3 | 1 | 109.081 | 519 | 0.181818 | | A3C_FourLaneLCFTREnv_84276_00001 | PENDING | | 0.975 | 2e-05 | VOLVONET_V4 | | | | | | A3C_FourLaneLCFTREnv_84276_00002 | PENDING | | 0.975 | 2e-05 | VOLVONET_V5 | | | | | | A3C_FourLaneLCFTREnv_84276_00003 | PENDING | | 0.975 | 2e-05 | VOLVONET_V6 | | | | | | A3C_FourLaneLCFTREnv_84276_00004 | PENDING | | 0.975 | 2e-05 | VOLVONET_V7 | | | | | | A3C_FourLaneLCFTREnv_84276_00005 | PENDING | | 0.975 | 2e-05 | VOLVONET_V3 | | | | | | A3C_FourLaneLCFTREnv_84276_00006 | PENDING | | 0.975 | 2e-05 | VOLVONET_V4 | | | | | | A3C_FourLaneLCFTREnv_84276_00007 | PENDING | | 0.975 | 2e-05 | VOLVONET_V5 | | | | | | A3C_FourLaneLCFTREnv_84276_00008 | PENDING | | 0.975 | 2e-05 | VOLVONET_V6 | | | | | | A3C_FourLaneLCFTREnv_84276_00009 | PENDING | | 0.975 | 2e-05 | VOLVONET_V7 | | | | | ±---------------------------------±---------±----------------±--------±------±------------------------------------±-------±-----------------±-----±---------+
== Status == Memory usage on this node: 8.8/62.6 GiB Using FIFO scheduling algorithm. Resources requested: 11/12 CPUs, 0/1 GPUs, 0.0/36.08 GiB heap, 0.0/12.45 GiB objects Result logdir: /home/kz430x/ray_results/841d9a36-9f61-11ea-b4a3-0242ac110002 Number of trials: 10 (9 PENDING, 1 RUNNING) ±---------------------------------±---------±----------------±--------±------±------------------------------------±-------±-----------------±-----±---------+ | Trial name | status | loc | gamma | lr | model/custom_options/architecture | iter | total time (s) | ts | reward | |----------------------------------±---------±----------------±--------±------±------------------------------------±-------±-----------------±-----±---------| | A3C_FourLaneLCFTREnv_84276_00000 | RUNNING | 172.17.0.2:6603 | 0.975 | 2e-05 | VOLVONET_V3 | 2 | 144.096 | 1052 | 0.173333 | | A3C_FourLaneLCFTREnv_84276_00001 | PENDING | | 0.975 | 2e-05 | VOLVONET_V4 | | | | | | A3C_FourLaneLCFTREnv_84276_00002 | PENDING | | 0.975 | 2e-05 | VOLVONET_V5 | | | | | | A3C_FourLaneLCFTREnv_84276_00003 | PENDING | | 0.975 | 2e-05 | VOLVONET_V6 | | | | | | A3C_FourLaneLCFTREnv_84276_00004 | PENDING | | 0.975 | 2e-05 | VOLVONET_V7 | | | | | | A3C_FourLaneLCFTREnv_84276_00005 | PENDING | | 0.975 | 2e-05 | VOLVONET_V3 | | | | | | A3C_FourLaneLCFTREnv_84276_00006 | PENDING | | 0.975 | 2e-05 | VOLVONET_V4 | | | | | | A3C_FourLaneLCFTREnv_84276_00007 | PENDING | | 0.975 | 2e-05 | VOLVONET_V5 | | | | | | A3C_FourLaneLCFTREnv_84276_00008 | PENDING | | 0.975 | 2e-05 | VOLVONET_V6 | | | | | | A3C_FourLaneLCFTREnv_84276_00009 | PENDING | | 0.975 | 2e-05 | VOLVONET_V7 | | | | | ±---------------------------------±---------±----------------±--------±------±------------------------------------±-------±-----------------±-----±---------+
== Status == Memory usage on this node: 8.8/62.6 GiB Using FIFO scheduling algorithm. Resources requested: 11/12 CPUs, 0/1 GPUs, 0.0/36.08 GiB heap, 0.0/12.45 GiB objects Result logdir: /home/kz430x/ray_results/841d9a36-9f61-11ea-b4a3-0242ac110002 Number of trials: 10 (9 PENDING, 1 RUNNING) ±---------------------------------±---------±----------------±--------±------±------------------------------------±-------±-----------------±-----±---------+ | Trial name | status | loc | gamma | lr | model/custom_options/architecture | iter | total time (s) | ts | reward | |----------------------------------±---------±----------------±--------±------±------------------------------------±-------±-----------------±-----±---------| | A3C_FourLaneLCFTREnv_84276_00000 | RUNNING | 172.17.0.2:6603 | 0.975 | 2e-05 | VOLVONET_V3 | 3 | 161.579 | 1581 | 0.16 | | A3C_FourLaneLCFTREnv_84276_00001 | PENDING | | 0.975 | 2e-05 | VOLVONET_V4 | | | | | | A3C_FourLaneLCFTREnv_84276_00002 | PENDING | | 0.975 | 2e-05 | VOLVONET_V5 | | | | | | A3C_FourLaneLCFTREnv_84276_00003 | PENDING | | 0.975 | 2e-05 | VOLVONET_V6 | | | | | | A3C_FourLaneLCFTREnv_84276_00004 | PENDING | | 0.975 | 2e-05 | VOLVONET_V7 | | | | | | A3C_FourLaneLCFTREnv_84276_00005 | PENDING | | 0.975 | 2e-05 | VOLVONET_V3 | | | | | | A3C_FourLaneLCFTREnv_84276_00006 | PENDING | | 0.975 | 2e-05 | VOLVONET_V4 | | | | | | A3C_FourLaneLCFTREnv_84276_00007 | PENDING | | 0.975 | 2e-05 | VOLVONET_V5 | | | | | | A3C_FourLaneLCFTREnv_84276_00008 | PENDING | | 0.975 | 2e-05 | VOLVONET_V6 | | | | | | A3C_FourLaneLCFTREnv_84276_00009 | PENDING | | 0.975 | 2e-05 | VOLVONET_V7 | | | | | ±---------------------------------±---------±----------------±--------±------±------------------------------------±-------±-----------------±-----±---------+
== Status == Memory usage on this node: 8.9/62.6 GiB Using FIFO scheduling algorithm. Resources requested: 11/12 CPUs, 0/1 GPUs, 0.0/36.08 GiB heap, 0.0/12.45 GiB objects Result logdir: /home/kz430x/ray_results/841d9a36-9f61-11ea-b4a3-0242ac110002 Number of trials: 10 (9 PENDING, 1 RUNNING) ±---------------------------------±---------±----------------±--------±------±------------------------------------±-------±-----------------±------±---------+ | Trial name | status | loc | gamma | lr | model/custom_options/architecture | iter | total time (s) | ts | reward | |----------------------------------±---------±----------------±--------±------±------------------------------------±-------±-----------------±------±---------| | A3C_FourLaneLCFTREnv_84276_00000 | RUNNING | 172.17.0.2:6603 | 0.975 | 2e-05 | VOLVONET_V3 | 5 | 245.442 | 10712 | 0.15 | | A3C_FourLaneLCFTREnv_84276_00001 | PENDING | | 0.975 | 2e-05 | VOLVONET_V4 | | | | | | A3C_FourLaneLCFTREnv_84276_00002 | PENDING | | 0.975 | 2e-05 | VOLVONET_V5 | | | | | | A3C_FourLaneLCFTREnv_84276_00003 | PENDING | | 0.975 | 2e-05 | VOLVONET_V6 | | | | | | A3C_FourLaneLCFTREnv_84276_00004 | PENDING | | 0.975 | 2e-05 | VOLVONET_V7 | | | | | | A3C_FourLaneLCFTREnv_84276_00005 | PENDING | | 0.975 | 2e-05 | VOLVONET_V3 | | | | | | A3C_FourLaneLCFTREnv_84276_00006 | PENDING | | 0.975 | 2e-05 | VOLVONET_V4 | | | | | | A3C_FourLaneLCFTREnv_84276_00007 | PENDING | | 0.975 | 2e-05 | VOLVONET_V5 | | | | | | A3C_FourLaneLCFTREnv_84276_00008 | PENDING | | 0.975 | 2e-05 | VOLVONET_V6 | | | | | | A3C_FourLaneLCFTREnv_84276_00009 | PENDING | | 0.975 | 2e-05 | VOLVONET_V7 | | | | | ±---------------------------------±---------±----------------±--------±------±------------------------------------±-------±-----------------±------±---------+
== Status == Memory usage on this node: 9.0/62.6 GiB Using FIFO scheduling algorithm. Resources requested: 11/12 CPUs, 0/1 GPUs, 0.0/36.08 GiB heap, 0.0/12.45 GiB objects Result logdir: /home/kz430x/ray_results/841d9a36-9f61-11ea-b4a3-0242ac110002 Number of trials: 10 (9 PENDING, 1 RUNNING) ±---------------------------------±---------±----------------±--------±------±------------------------------------±-------±-----------------±------±---------+ | Trial name | status | loc | gamma | lr | model/custom_options/architecture | iter | total time (s) | ts | reward | |----------------------------------±---------±----------------±--------±------±------------------------------------±-------±-----------------±------±---------| | A3C_FourLaneLCFTREnv_84276_00000 | RUNNING | 172.17.0.2:6603 | 0.975 | 2e-05 | VOLVONET_V3 | 7 | 301.507 | 16688 | 0.14 | | A3C_FourLaneLCFTREnv_84276_00001 | PENDING | | 0.975 | 2e-05 | VOLVONET_V4 | | | | | | A3C_FourLaneLCFTREnv_84276_00002 | PENDING | | 0.975 | 2e-05 | VOLVONET_V5 | | | | | | A3C_FourLaneLCFTREnv_84276_00003 | PENDING | | 0.975 | 2e-05 | VOLVONET_V6 | | | | | | A3C_FourLaneLCFTREnv_84276_00004 | PENDING | | 0.975 | 2e-05 | VOLVONET_V7 | | | | | | A3C_FourLaneLCFTREnv_84276_00005 | PENDING | | 0.975 | 2e-05 | VOLVONET_V3 | | | | | | A3C_FourLaneLCFTREnv_84276_00006 | PENDING | | 0.975 | 2e-05 | VOLVONET_V4 | | | | | | A3C_FourLaneLCFTREnv_84276_00007 | PENDING | | 0.975 | 2e-05 | VOLVONET_V5 | | | | | | A3C_FourLaneLCFTREnv_84276_00008 | PENDING | | 0.975 | 2e-05 | VOLVONET_V6 | | | | | | A3C_FourLaneLCFTREnv_84276_00009 | PENDING | | 0.975 | 2e-05 | VOLVONET_V7 | | | | | ±---------------------------------±---------±----------------±--------±------±------------------------------------±-------±-----------------±------±---------+
== Status == Memory usage on this node: 9.0/62.6 GiB Using FIFO scheduling algorithm. Resources requested: 11/12 CPUs, 0/1 GPUs, 0.0/36.08 GiB heap, 0.0/12.45 GiB objects Result logdir: /home/kz430x/ray_results/841d9a36-9f61-11ea-b4a3-0242ac110002 Number of trials: 10 (9 PENDING, 1 RUNNING) ±---------------------------------±---------±----------------±--------±------±------------------------------------±-------±-----------------±------±---------+ | Trial name | status | loc | gamma | lr | model/custom_options/architecture | iter | total time (s) | ts | reward | |----------------------------------±---------±----------------±--------±------±------------------------------------±-------±-----------------±------±---------| | A3C_FourLaneLCFTREnv_84276_00000 | RUNNING | 172.17.0.2:6603 | 0.975 | 2e-05 | VOLVONET_V3 | 8 | 332.717 | 17237 | 0.15 | | A3C_FourLaneLCFTREnv_84276_00001 | PENDING | | 0.975 | 2e-05 | VOLVONET_V4 | | | | | | A3C_FourLaneLCFTREnv_84276_00002 | PENDING | | 0.975 | 2e-05 | VOLVONET_V5 | | | | | | A3C_FourLaneLCFTREnv_84276_00003 | PENDING | | 0.975 | 2e-05 | VOLVONET_V6 | | | | | | A3C_FourLaneLCFTREnv_84276_00004 | PENDING | | 0.975 | 2e-05 | VOLVONET_V7 | | | | | | A3C_FourLaneLCFTREnv_84276_00005 | PENDING | | 0.975 | 2e-05 | VOLVONET_V3 | | | | | | A3C_FourLaneLCFTREnv_84276_00006 | PENDING | | 0.975 | 2e-05 | VOLVONET_V4 | | | | | | A3C_FourLaneLCFTREnv_84276_00007 | PENDING | | 0.975 | 2e-05 | VOLVONET_V5 | | | | | | A3C_FourLaneLCFTREnv_84276_00008 | PENDING | | 0.975 | 2e-05 | VOLVONET_V6 | | | | | | A3C_FourLaneLCFTREnv_84276_00009 | PENDING | | 0.975 | 2e-05 | VOLVONET_V7 | | | | | ±---------------------------------±---------±----------------±--------±------±------------------------------------±-------±-----------------±------±---------+
- The training with same environment, same RLlib trainer, and same model takes roughly 50% more time = ~33% slowdown, compared to Ray 0.7.3.
Ray is on 0.9.0dev
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
If we cannot run your script, we cannot fix your issue.
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 20 (14 by maintainers)
OK, new insights from playing around the whole day (for the sake of whoever is going to read this):
The above is a complete mistake. AsyncGradientsOptimizer was never implemented to accumulate gradients and only then apply them batched, to the local worker’s model. It was actually doing the same thing as I’ve writen above for 0.8.x so logic hasn’t changed! Sorry for all the mess.
I’ve implemented such a version in 0.8.7 where the local worker accumulates (sums) gradients until a “full batch” is accumulated and only then applies them and publishes the new weights to all workers. Running on a single environment implementation I could actually observed a decrement in performance (learning is less robust).
As stated previously, in the current implementation of A3C gradients are applied a single rollout_fragment at a time. Given this, an “iteration” is a reporting-only concept. The iteration can be controlled via min_iter_time_s for time-cap and via timesteps_per_iteration for sim-step-cap (which was what I was originally looking for, before I learned the meaning of Async gradients here…)
My slowdowns in 0.8.x are eventually a consequence of not being able to run on sample_async=true mode when using PyTorch. This is a result of migrating to ModelV2 and decoupling the value function’s output from .forward(). It seems like quite a lost for PyTorch users, since PyTorch is already thread-safe, and if forward would have been returning the value function estimates, sample_async could have been set to true for PyTorch as well.
I’m going to close this issue and open a new one for discussion around enabling sample_async for pytorch.