stable-baselines3-contrib: Problems with MaskablePPO

πŸ› Bug

Hi I had problems with maskable ppo which I described here https://github.com/DLR-RM/stable-baselines3/issues/1596. I thought that I found solution in one of issues https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/issues/81#issuecomment-1179910351. The problem is that error stopped occuring but in the same time agent lost its ability to learn. Below are screenshots of mean rewards, 150k and 260k timesteps are for case with error and 4M timesteps is for case without error. 150kSteps 260kSteps wykresy Unfortunately I don’t have screenshots of learning process where agent menaged to get mean reward of ~-0.75 before error.

Code example

The only thing I changed since last issue in code is solution from #81

# Reinitialize with updated logits
        super().__init__(logits=logits, validate_args=False)

        # self.probs may already be cached, so we must force an update
        self.probs = logits_to_probs(self.logits)

Relevant log output / Error message

No response

System Info

No response

Checklist

About this issue

Most upvoted comments