stable-baselines3-contrib: Problems with MaskablePPO
π Bug
Hi
I had problems with maskable ppo which I described here https://github.com/DLR-RM/stable-baselines3/issues/1596. I thought that I found solution in one of issues https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/issues/81#issuecomment-1179910351. The problem is that error stopped occuring but in the same time agent lost its ability to learn. Below are screenshots of mean rewards, 150k and 260k timesteps are for case with error and 4M timesteps is for case without error.
Unfortunately I donβt have screenshots of learning process where agent menaged to get mean reward of ~-0.75 before error.
Code example
The only thing I changed since last issue in code is solution from #81
# Reinitialize with updated logits
super().__init__(logits=logits, validate_args=False)
# self.probs may already be cached, so we must force an update
self.probs = logits_to_probs(self.logits)
Relevant log output / Error message
No response
System Info
No response
Checklist
- I have checked that there is no similar issue in the repo
- I have read the documentation
- I have provided a minimal and working example to reproduce the bug
- I have checked my env using the env checker
- Iβve used the markdown code blocks for both code and stack traces.
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 16
https://github.com/pytorch/pytorch/issues/87468