stable-baselines: [bug] PPO2 episode reward summaries are written incorrectly for VecEnvs

Episode reward summaries are all concentrated together on a few steps, with jumps in between.

Zoomed out: image

Zoomed in: image

Every other summary looks fine: image

To reproduce, run PPO2 on DummyVecEnv(["Pendulum-v0" for _ in range(8)]).

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Reactions: 4
  • Comments: 16

Most upvoted comments

Hi, I also encountered some issues described in the comments above. A recap follows.

PPO2 tensorboard visualization issues

If you run ppo2 with a single process training for 256 timesteps (N=1, T=256) and try to visualize the episode reward and the optimization statistics:

  1. the episode_reward is shifted of T (instead of being in [0,256], it is plotted in [256,512]) for the reason explained in https://github.com/hill-a/stable-baselines/issues/143#issuecomment-552952355
  2. the loss statistics are associated with weird timesteps (i.e. [527,782]) obtained as a result of the timestep calculations highlighted in https://github.com/hill-a/stable-baselines/issues/143#issuecomment-584530173

issue_10_marzo_1

Moreover, if you try to plot data using multiple processes (for instance N=4 workers with T=256 timesteps per worker):

  1. the collected reward are superposed in the first T timesteps followed by a jump of (N-1)*T timesteps in the plot

PPO tensorboard visualization proposed solution

I implemented the following solutions for the visualization issues:

  1. decreasing the timesteps index by the batch size before plotting
  2. simplifying the logic for plotting the optimization statistics:
    • each optimization is made of K epochs on N*T//M minibatches (being M the training timesteps related to a minibatch), therefore a fixed number of data is collected during the optimization, namely K * N*T//M
    • in order to retain visual comparison of the episode reward and the optimization statistics, the K * N*T//M optimization data are equally distributed over the batch size N*T
  3. adding an offset for each process

As a result, in the showcases above:

  1. the episode_reward is correctly plotted [0,256]
  2. the loss statistics are plotted in [0,256] as well, equally distributed

issue_10_Marzo_3

  1. the rewards collected by the N workers are plotted side by side

The modifications are just a few and straightforward. Regarding the side-by-side visualization of the rewards in the multiprocess case, do you believe that plotting the mean and variance of the collected data would instead be more appropriate?

If it is appreciated, I would open a PR with the implemented modifications, which I can update if the mean and variance solution is recommended.

@paolo-viceconte thanks, I’ll try to take a look at what you did this week (unless @Miffyli can do it before), we have too many issue related to that function (cf all linked issues).

@balintkozma

Thanks for the quick reply!

I think that could also be fixed in the same PR, as these two are relate-…

Ninj’d by Arrafin

Not implemented yet, I will create a separete PR.

Please do only one PR that solves this issue.