stable-baselines3-contrib: Problems with MaskablePPO

🐛 Bug

Hi I had problems with maskable ppo which I described here https://github.com/DLR-RM/stable-baselines3/issues/1596. I thought that I found solution in one of issues https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/issues/81#issuecomment-1179910351. The problem is that error stopped occuring but in the same time agent lost its ability to learn. Below are screenshots of mean rewards, 150k and 260k timesteps are for case with error and 4M timesteps is for case without error. 150kSteps 260kSteps wykresy Unfortunately I don’t have screenshots of learning process where agent menaged to get mean reward of ~-0.75 before error.

Code example

The only thing I changed since last issue in code is solution from #81

# Reinitialize with updated logits
        super().__init__(logits=logits, validate_args=False)

        # self.probs may already be cached, so we must force an update
        self.probs = logits_to_probs(self.logits)

Relevant log output / Error message

No response

System Info

No response

Checklist

I have checked that there is no similar issue in the repo
I have read the documentation
I have provided a minimal and working example to reproduce the bug
I have checked my env using the env checker
I’ve used the markdown code blocks for both code and stack traces.

About this issue

Original URL
State: open
Created a year ago
Comments: 16

Most upvoted comments

https://github.com/pytorch/pytorch/issues/87468

+1

yiptsangkin on Jul 22, 2023

Read more comments on GitHub

← stable-baselines3-contrib: PPORecurrent mini batch size inconsistent

Angular2-Toaster: Toast TapToDismiss does not override the global config →