stable-baselines3: [Bug] NaN's if a single observation is sampled from PPO's rollout_buffer

🐛 Bug

I am using PPO and am adapting a few things of it to experiment with an unusual adversary setting, including for now in an ad-hoc manner adjusting the rollout-buffer way to have a dynamic size for it. Doing this I found that PPO can run into nan values whenever it samples a batch size of size 1 due to this in PPO.py, line 170:

                advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

where advantages.std() over a batch of size 1 is not defined.

While my current scenario is of course nothing common, this seems to me like a case that could more generally occur whenever someone uses a btach_size of size rollout_buffer.size + 1 or someone wants to use a dynamically sized replay buffer. It could also easily be accounted for.

To Reproduce

Run PPO with a batch_size of 1 or a rollout_buffer of size batch_size + 1 (i.e. by setting n_steps to this value)

Expected behavior

I would suggest to either discard batches when only a single observation is left in the rollout_buffer or check the size of the sampled buffer and set the std to either 1 or the minimum value in this case.

### System Info

stable-baselines v. 0.10.0

Checklist

I have checked that there is no similar issue in the repo (required)
I have read the documentation (required)
I have provided a minimal working example to reproduce the bug (required)

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 16 (15 by maintainers)

Most upvoted comments

Oh, maybe emails are rate limited, so you will receive these new emails in December and February respectively, lol 😄

On Tue, Oct 11, 2022, 04:26 Hugh Perkins @.***> wrote:

I sent that email in August 😛

On Tue, Oct 11, 2022, 04:25 Hugh Perkins @.***> wrote:

That’s a very old email. Not sure why it is appearing as a comment now…

On Tue, Oct 11, 2022, 04:09 Antonin RAFFIN @.***> wrote:

@hughperkins https://github.com/hughperkins not sure to understand your comment… this issue was fixed by yourself in #1028 https://github.com/DLR-RM/stable-baselines3/pull/1028

— Reply to this email directly, view it on GitHub https://github.com/DLR-RM/stable-baselines3/issues/325#issuecomment-1274272922, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA6FKFNOJ6BVLO7YW5OJPLWCUOCFANCNFSM4X7CA53Q . You are receiving this because you were mentioned.Message ID: @.***>

hughperkins on Oct 11, 2022

I think there are two issues we can factorize:

how to avoid the nans => the advantage fix you allude to. I agree
should partial batches be skipped => this would be backwards incomptaible, so should be a new option, that defaults to not skipping partial batches.

For now, I’m just going to leave this code and comments here. Maybe I will submit a PR for either or both of these. My own code works now anyway, and I’ve made the code available that I’m using, via this PR, if anyone else wants to use it.

On Tue, Aug 23, 2022, 09:20 Hugh Perkins @.***> wrote:

My buffer size is not fixed. I collect the roll out myself, using a custom method, therefore it is a probability. It fails one time in batch size. I bumped the batch size up to 2048, and the nans were less frequent. I thought this was because batch size 2048 stabilized the learning. But it was because the bug only occurs with probability 1/batch size, for me

On Tue, Aug 23, 2022, 08:54 Antonin RAFFIN @.***> wrote:

(didn’t see your comment whilst working on the PR).

that’s why we recommend to discuss and agree on the solution first in an issue… I will still try to take a look at your PR in the coming days.

Thinking about it again, I think the best would be to skip advantage normalization if the batch size is of size 1.

But… do we want to train on partial batches?

we usually want to train on all the collected data.

The learning rate will be wrong for the partial batches,

that’s true, at the same time there should be only one partial batch.

if batch size is not a factor, then each learning stage, there is a probability of 1: batch_size of the buffer returning batch size of 1 for last batch

this should not be a probability. It return a batch of size 1 if buffer_size % batch_size == 1. For instance:

from stable_baselines3 import PPOfrom stable_baselines3.common.env_util import make_vec_env

With one envn_envs = 1n_steps = 64batch_size = 63# With two# n_envs = 2# n_steps = 32# batch_size = 63

buffer_size = n_envs * n_stepsassert buffer_size % batch_size == 1 env = make_vec_env(“Pendulum-v1”, n_envs=n_envs)# Fix1: normalize_advantage=FalsePPO(“MlpPolicy”, env, verbose=1, batch_size=batch_size, n_steps=n_steps).learn(10_000)

— Reply to this email directly, view it on GitHub https://github.com/DLR-RM/stable-baselines3/issues/325#issuecomment-1224034225, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA6FKEGYVHJ5WGS5EYLET3V2TCZNANCNFSM4X7CA53Q . You are receiving this because you commented.Message ID: @.***>

hughperkins on Aug 23, 2022