pfrl: Cannot pass multiple inputs to a recurrent policy with PPO

Here is the definition of my policy that recurrently maps two inputs to an action and value estimation. The policy takes two PackedSequences put in a tuple. The model works (more or less) as I expected.

class Foo(nn.Module):
  def __init__(self):
    super().__init__()
    self.cnn = nn.Sequential(nn.Conv2d(3, 32, 3),
                  nn.ReLU(),
                  nn.Conv2d(32, 64, 3),
                  nn.ReLU(),
                  nn.Flatten())
 
  def forward(self, x):
    cnn_out = self.cnn(x[0])
    out = torch.cat((cnn_out, x[1]), 1)
    return out
 
foo = pfrl.nn.RecurrentSequential(
    Foo(),
    nn.GRU(num_layers=1, input_size=64 * 4 * 4 + 12, hidden_size=128),
    pfrl.nn.Branched(
        nn.Sequential(nn.Linear(128, 4),
                      SoftmaxCategoricalHead(),),
        nn.Linear(128, 1),
    )
)
 
print(foo((torch.nn.utils.rnn.pack_sequence(torch.rand(1, 32, 3, 8, 8)), torch.nn.utils.rnn.pack_sequence(torch.rand(1, 32, 12))), None))

I am trying to use this with PPO. This time I put two tensors in a tuple hoping that they are converted to two PackedSequences in the agent. However, the preprocessing of the tensors throws the following error:

opt = torch.optim.Adam(foo.parameters(), lr=2.5e-4, eps=1e-5)

def phi(x):
    return x
 
agent = PPO(
        foo,
        opt,
        gpu=-1,
        phi=phi,
        update_interval=8,
        minibatch_size=32*8,
        epochs=4,
        clip_eps=0.1,
        clip_eps_vf=None,
        standardize_advantages=True,
        entropy_coef=1e-2,
        recurrent=True,
        max_grad_norm=0.5,
    )

agent.batch_act(
    (
        (torch.rand([1, 32, 3, 8, 8]), torch.rand([1, 32, 12]),)
     ,),
)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-70b107928cd6> in <module>()
      1 agent.batch_act(
      2     (
----> 3         (torch.rand([1, 32, 3, 8, 8]), torch.rand([1, 32, 12]),)
      4      ,),
      5 )

3 frames
/usr/local/lib/python3.6/dist-packages/pfrl/agents/ppo.py in batch_act(self, batch_obs)
    652     def batch_act(self, batch_obs):
    653         if self.training:
--> 654             return self._batch_act_train(batch_obs)
    655         else:
    656             return self._batch_act_eval(batch_obs)

/usr/local/lib/python3.6/dist-packages/pfrl/agents/ppo.py in _batch_act_train(self, batch_obs)
    706                     self.train_recurrent_states,
    707                 ) = one_step_forward(
--> 708                     self.model, b_state, self.train_prev_recurrent_states
    709                 )
    710             else:

/usr/local/lib/python3.6/dist-packages/pfrl/utils/recurrent.py in one_step_forward(rnn, batch_input, recurrent_state)
    139         object: New batched recurrent state.
    140     """
--> 141     pack = pack_one_step_batch_as_sequences(batch_input)
    142     y, recurrent_state = rnn(pack, recurrent_state)
    143     return unpack_sequences_as_one_step_batch(y), recurrent_state

/usr/local/lib/python3.6/dist-packages/pfrl/utils/recurrent.py in pack_one_step_batch_as_sequences(xs)
    115         return tuple(pack_one_step_batch_as_sequences(x) for x in xs)
    116     else:
--> 117         return nn.utils.rnn.pack_sequence(xs[:, None])
    118 
    119 

TypeError: list indices must be integers or slices, not tuple

The input tuple is converted to a list by pfrl.util.batch_states(), which is called inside pfrl.agents.PPO._batch_act_train(). The list is then passed to pfrl.util.recurrent.pack_one_step_batch_as_sequences() , but it expects a tuple. Maybe we can just collect multiple inputs in a tuple instead of a list in pfrl.util.batch_states()?

I am still figuring out pfrl, and perhaps I am not correctly passing multiple inputs to a recurrent policy. Suggestions are welcome.

The snippet is found here: https://colab.research.google.com/drive/1wqEtZTvwu0IN7oZnbrp34W7lBxVyGhp6?usp=sharing

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 18 (17 by maintainers)

Most upvoted comments

I had the exact same problem and couldn’t find a way to use tuples as the observation from the agent. My observation space has a two-channel image and vector with length 320. I ended up creating a three-channel image, convert vector data to 2D matrix and assign it to the 3rd channel of the image. In the forward pass of the network, I unpack the 3rd image and reshape it to vector.

Below is the policy:

class IMPALAGRU(nn.Module):
    def __init__(self, obs_dim=3):
        super(IMPALAGRU, self).__init__()

        self.image_encoder = IMPALACNN(obs_dim=obs_dim - 1)

    def forward(self, obs):
        images = obs[:, 0:2]
        tactile = obs[:, 2]
        tactile = tactile.view(-1, 64 * 64)
        tactile = tactile[:, 0:320]

        image_features = self.image_encoder(images)
        features = torch.cat((image_features, tactile), dim=1)

        return features 

model = pfrl.nn.RecurrentSequential(
    IMPALAGRU(),
    lecun_init(nn.GRU(num_layers=1, input_size=512 + 320, hidden_size=512)),
    pfrl.nn.Branched(
        nn.Sequential(
            lecun_init(nn.Linear(512, num_actions), 1e-2),
            SoftmaxCategoricalHead(),
        ),
        lecun_init(nn.Linear(512, 1))
    )
)