pfrl: Cannot pass multiple inputs to a recurrent policy with PPO
Here is the definition of my policy that recurrently maps two inputs to an action and value estimation. The policy takes two PackedSequences put in a tuple. The model works (more or less) as I expected.
class Foo(nn.Module):
def __init__(self):
super().__init__()
self.cnn = nn.Sequential(nn.Conv2d(3, 32, 3),
nn.ReLU(),
nn.Conv2d(32, 64, 3),
nn.ReLU(),
nn.Flatten())
def forward(self, x):
cnn_out = self.cnn(x[0])
out = torch.cat((cnn_out, x[1]), 1)
return out
foo = pfrl.nn.RecurrentSequential(
Foo(),
nn.GRU(num_layers=1, input_size=64 * 4 * 4 + 12, hidden_size=128),
pfrl.nn.Branched(
nn.Sequential(nn.Linear(128, 4),
SoftmaxCategoricalHead(),),
nn.Linear(128, 1),
)
)
print(foo((torch.nn.utils.rnn.pack_sequence(torch.rand(1, 32, 3, 8, 8)), torch.nn.utils.rnn.pack_sequence(torch.rand(1, 32, 12))), None))
I am trying to use this with PPO. This time I put two tensors in a tuple hoping that they are converted to two PackedSequences in the agent. However, the preprocessing of the tensors throws the following error:
opt = torch.optim.Adam(foo.parameters(), lr=2.5e-4, eps=1e-5)
def phi(x):
return x
agent = PPO(
foo,
opt,
gpu=-1,
phi=phi,
update_interval=8,
minibatch_size=32*8,
epochs=4,
clip_eps=0.1,
clip_eps_vf=None,
standardize_advantages=True,
entropy_coef=1e-2,
recurrent=True,
max_grad_norm=0.5,
)
agent.batch_act(
(
(torch.rand([1, 32, 3, 8, 8]), torch.rand([1, 32, 12]),)
,),
)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-11-70b107928cd6> in <module>()
1 agent.batch_act(
2 (
----> 3 (torch.rand([1, 32, 3, 8, 8]), torch.rand([1, 32, 12]),)
4 ,),
5 )
3 frames
/usr/local/lib/python3.6/dist-packages/pfrl/agents/ppo.py in batch_act(self, batch_obs)
652 def batch_act(self, batch_obs):
653 if self.training:
--> 654 return self._batch_act_train(batch_obs)
655 else:
656 return self._batch_act_eval(batch_obs)
/usr/local/lib/python3.6/dist-packages/pfrl/agents/ppo.py in _batch_act_train(self, batch_obs)
706 self.train_recurrent_states,
707 ) = one_step_forward(
--> 708 self.model, b_state, self.train_prev_recurrent_states
709 )
710 else:
/usr/local/lib/python3.6/dist-packages/pfrl/utils/recurrent.py in one_step_forward(rnn, batch_input, recurrent_state)
139 object: New batched recurrent state.
140 """
--> 141 pack = pack_one_step_batch_as_sequences(batch_input)
142 y, recurrent_state = rnn(pack, recurrent_state)
143 return unpack_sequences_as_one_step_batch(y), recurrent_state
/usr/local/lib/python3.6/dist-packages/pfrl/utils/recurrent.py in pack_one_step_batch_as_sequences(xs)
115 return tuple(pack_one_step_batch_as_sequences(x) for x in xs)
116 else:
--> 117 return nn.utils.rnn.pack_sequence(xs[:, None])
118
119
TypeError: list indices must be integers or slices, not tuple
The input tuple is converted to a list by pfrl.util.batch_states(), which is called inside pfrl.agents.PPO._batch_act_train(). The list is then passed to pfrl.util.recurrent.pack_one_step_batch_as_sequences() , but it expects a tuple. Maybe we can just collect multiple inputs in a tuple instead of a list in pfrl.util.batch_states()?
I am still figuring out pfrl, and perhaps I am not correctly passing multiple inputs to a recurrent policy. Suggestions are welcome.
The snippet is found here: https://colab.research.google.com/drive/1wqEtZTvwu0IN7oZnbrp34W7lBxVyGhp6?usp=sharing
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 18 (17 by maintainers)
I had the exact same problem and couldn’t find a way to use tuples as the observation from the agent. My observation space has a two-channel image and vector with length 320. I ended up creating a three-channel image, convert vector data to 2D matrix and assign it to the 3rd channel of the image. In the forward pass of the network, I unpack the 3rd image and reshape it to vector.
Below is the policy:
Sure, I will also add some tests to https://github.com/pfnet/pfrl/blob/master/tests/utils_tests/test_batch_states.py .