stable-baselines3: AtariWrapper does not use recommended defaults

The current AtariWrapper by default has terminate_on_life_loss set to True. This goes against the recommendations of Revisiting the Arcade Learning Environment (https://arxiv.org/pdf/1709.06009.pdf). I believe this should be set to False by default. They also recommend using sticky actions instead of noop resets, but I think that problem is outside the scope of this wrapper.

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 24 (11 by maintainers)

Most upvoted comments

Would it be possible to set up a system for people to contribute individual runs?

yes =D, that’s the whole point of the openrl benchmark initiative by @vwxyzjn (best is probably to open an issue there to keep track of things and ask for permission to add runs to the sb3 namespace).

Aside from that, would it make sense to change the defaults and put a disclaimer on the existing Atari results that they were run with sticky actions and terminate on life loss disabled? That way new results will use the good defaults,

my whole point for not changing is that I want the result first and change the default once we got the baselines (I also wouldn’t call them “good default” but “new defaults”). But we can later tag all atari runs ran with v4 (see discussion in https://github.com/DLR-RM/rl-baselines3-zoo/pull/336).

this is indeed good practice but not everybody can afford doing so (students/independent researchers/small labs/countries where compute is actually much more expensive/…).

Sorry, I didn’t intend for that to sound like a moral argument, I meant to say that I haven’t worked on any papers that required it. I agree that having reliable baselines available is important for those reasons.

if we want to run on all atari games, a lot… (you need 10 experiments x n_games x n_algorithms and this grows quickly…)

That is certainly a lot. It’s unfortunate but without baselines like this, it’s hard for the community to adopt the recommendations from the paper. Would it be possible to set up a system for people to contribute individual runs? I certainly can’t afford all that compute myself, but I’d be willing to contribute runs where I can. Aside from sticky actions, I feel like that would be helpful for a bunch of things. If it’s possible to export runs on wandb it shouldn’t be too hard to do.

Aside from that, would it make sense to change the defaults and put a disclaimer on the existing Atari results that they were run with sticky actions and terminate on life loss disabled? That way new results will use the good defaults, but if people want to compare to old results they’ll see the disclaimer and change their settings accordingly. Not perfect but it might be a step in the right direction.

Is the issue that people are using existing SB3 results in their papers, and might mistakenly attribute the charts that you have now

yes, that’s the issue. And not only about SB3 results but all results published so far. You want to be able to compare apples to apples.

I personally don’t use results from a repository without rerunning the baselines myself, but I see why that could be a concern.

this is indeed good practice but not everybody can afford doing so (students/independent researchers/small labs/countries where compute is actually much more expensive/…).

On that note, how much compute would it take to just run these baselines yourself?

if we want to run on all atari games, a lot… (you need 10 experiments x n_games x n_algorithms and this grows quickly…) @qgallouedec is currently running a full benchmark of the RL Zoo, and we are using a subset of Atari games (https://github.com/openrlbenchmark/openrlbenchmark/issues/7), all runs is logged using W&B and help is welcomed =)

TL;DR, sticky actions are the recommended way to prevent agents from abusing determinism, not a way to improve rewards.

thanks for you answer =) My question was not about checking that it improves performance, but rather just know the impact (positive or negative) on performance. Because if you don’t have a baseline (I think the paper mostly benchmarked DQN), then it’s hard to compare things. My point is more that the changes were made to Atari defaults, but we still have no idea what is the difference in performance and what are baselines results for those defaults. As shown by my quick experiments, there might be significant changes, but we currently don’t know…

My concern, and I’m sure @JesseFarebro (the maintainer of the ALE) would agree, is that the settings in Gym environments for Atari was never really done to begin with and that for people doing future work with then should use what have been the recommended practices with then for years. This actually caused an issue with us working with Atari games for an ICML paper, which is why Ryan created the issue.

I just realized that I should have put this issue in the actual stable baselines 3 repo, but I guess it’s relevant here as well. I definitely understand the trade-off between using newer recommendations and preserving fair comparisons to previous work.