stable-baselines3: SubprocEnv does not call `reset()` when expected and `set_attr` behaves weirdly

šŸ› Bug

I have a custom gym.Env (which passes all check_env) steps. Having started this project months ago, I am running on Gym 0.21.0 and SB3 1.8.0. My custom env has an attribute (device_freeze), which is used in the reset() method to trigger a change in the environment dynamics. The goal is to have a callback which every n_timesteps interact with my custom env and sets device_freeze to False. In this way, during the following call of the reset() method, the environment’s dynamics are changed. Please note that changing the dynamics at the end of an episode—and not during the episode—is more than a whim: changing the environment dynamics within the episode goes against the MDP formulation that I am considering at the moment (or at least that’s what I believe, open on theoretical feedback on this).

Here’s the source code of the custom env (inheriting from another custom env which inherits from a BaseEvn subclassing gym.Env—it has been a long project ahahaha):

import numpy as np
from commons import BaseInterface
from .oscar import OscarEnv
from typing import Iterable, Text, Dict
from numpy.typing import NDArray

class MarcellaEnv(OscarEnv): 
    def __init__(self, 
                         [..., parent class args])
                         devices_and_bounds:Dict[Text, float]={"device1": 50., "device2": 6., "device2": 7.}):
        """
        This argument stores the list of devices used for multi-tasking training and the respective performance bounds. 
        """
        self.device_freeze = True # here the problematic argument
        self.devices_and_bounds = devices_and_bounds

        super().__init__(
           [...parent args]
        )

    @property
    def name(self): 
        return "marcella"
    
    def change_device(self):
        """
        Change the target device based on random selection if device freeze is not enabled.

        Returns:
            None

        Note:
            The method randomly selects a new device from the available devices and updates the target
            device and upper accordingly. This only happens if device freeze is not enabled.

        """
        if not self.device_freeze:
            new_device = np.random.choice(list(self.devices_and_bounds.keys()))
            self.target_device = new_device
            self.new_bound = self.devices_and_bounds[new_device]
            # entered the loop because device freeze was False, switch sets it to True
            self.device_freeze = True

    def reset(self)->NDArray:
        """Resets custom env attributes."""
        self._observation = self.observation_space.sample()
        self.change_device()
        self.update_current_net()

        self.timestep_counter= 0

        return self._get_obs()

The following is instead the callback code:

"""Custom callbacks to be used during training to record the learnign process."""
from stable_baselines3.common.callbacks import BaseCallback
import numpy as np
from stable_baselines3.common.vec_env import VecEnv
from typing import Text, Iterable

class MultiTask_Callback(BaseCallback): 
    """Custom callback inheriting from `BaseCallback`.

    :param verbose: (int) Verbosity level 0: not output 1: info 2: debug.

    Performs various actions when triggered (intended to be a child of EventCallback): 
        1. Evaluates current policy (for n_eval_episodes)
        2. Updates a current best_policy variable
        3. Logs stuff on wandb. More details on what is logged in :meth:_on_step.
    """
    def __init__(
            self,
            verbose:int=0):
        """Init function defines callback context."""
        super().__init__(verbose)
        
        self.devices_history = []

    def _on_step(self) -> bool:
        """
        This method will be called by the model after each call to `_env.step()`.
        For child callback (of an `EventCallback`), this will be called
        when the event is triggered.
        :return: (bool) If the callback returns False, training is aborted early.
        """
        # storing the current hardware used for training
        current_device = self.model.env.get_attr("target_device")

        # stores the target hardware the model has been currently training on
        self.devices_history.append(current_device)

        print(self.model.env.get_attr("device_freeze"), self.model.env.get_attr("target_device"))
        
        # flips the switch that prevents different devices to be chosen at episode init
        self.model.env.set_attr("device_freeze", False)
 
        return True
    
    def get_devices_history(self):
        """Returns the full history of hardware devices"""
        return self.devices_history

When using DummyVecEnv or SubprocEnv I obtain the following std-output from the callback execution within a training scipt:

[True] ['device1']
[False] ['device1']
[False]  ['device1']
[False]  ['device1']
[False] ['device1']
[False] ['device1']
...

Which is not plausible, since the print is triggered every 30 timesteps and the maximal number of timesteps is set to 50. However, if I modify the callback code in the following way:

self.model.env.set_attr("device_freeze", False)

becomes:

for env_idx in range(self.model.env.num_envs):
      # manually changing, since setattr appears to be useless here
      self.model.env.envs[env_idx].unwrapped.device_freeze = False

Then the output I obtain is:

[True] ['device1']
[False]  ['device1']
[True]  ['device2']
[True]  ['device3']
[True]  ['device3']
[False]  ['device3']
[True] ['device1']
[True] ['device2']
[False] ['device2']
[True] ['device3']
[False]['device3']

My issue is that when this ā€œforcefulā€ fix would clearly not work for SubprocEnv (I’ve already tried) since SubprocEnv does not allow to iterate across envs.

What is very weird here is the fact that setattr does not change the value of the device_freeze but once. Then, it looks like it is ignored even tho the callback _on_step() method is executed (one printed line per call).

Would really appreciate some help in figuring out why this!

Code example

Code in issue message.

Relevant log output / Error message

AttributeError: SubprocEnv does not have the envs attribute

System Info

Libraries installed via pip.

  • OS: Linux-5.13.0-52-generic-x86_64-with-glibc2.31 # 59~20.04.1-Ubuntu SMP Thu Jun 16 21:21:28 UTC 2022
  • Python: 3.10.8
  • Stable-Baselines3: 1.8.0
  • PyTorch: 2.0.1+cu117
  • GPU Enabled: True
  • Numpy: 1.24.3
  • Gym: 0.21.0

Checklist

  • I have checked that there is no similar issue in the repo
  • I have read the documentation
  • I have provided a minimal working example to reproduce the bug
  • I have checked my env using the env checker
  • I’ve used the markdown code blocks for both code and stack traces.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 20 (12 by maintainers)

Commits related to this issue

Most upvoted comments

If you want to submit this improvement in the documentation, I encourage you to open a PR 😃