Gymnasium: [Bug Report] `CartPole`'s reward function constantly returns 1 (even when it falls)
Describe the bug
The CartPole
environment provides reward==1
when the pole “stands” and reward==1
when the pole has “fallen”.
The old gym documentation mentioned that this was the behavior, and so does the current documentation, indicating that this is the desired behavior, but I can find no evidence that this was the design goal.
The argument could be made that “state-of-the-art algorithms (e.g. DQN
, A2C
, PPO
) can easily solve this environment anyway, so why bother?”, which is true that the environment is very easy to solve, but it is still important to examine why, and what the consequences are in these learning algorithms and potential future ones.
The reason the algorithms are able to learn a policy, despite the fact that the environment is effectively reward-free, is because of the sampling bias introduced by the number of actions taken per given policy.
If the agent has a “good policy”, then on average it will be alive longer than if it has a “bad policy”, and this will cause the “good policy” to be sampled more often on average by the train()
subroutine of the learning algorithm, and therefore a “good policy” will be reinforced more than a “bad policy” (instead of the “good policy” being reinforced and the “bad policy” being reduced, which is what normally happens in RL).
This would mean that RL algorithms that are not affected by this sampling bias (or are much less affected by it) would not be able to learn a policy for the CartPole-v1
environment,
And the CartPole environment is where many RL algorithms are tested during the prototyping phase (since it is probably assumed that if a learning algorithm cannot solve CartPole, it is probably a dud).
Therefore, if an algorithm that was developed was less affected by this sampling bias, it might have failed the prototyping phase because of the wrong implementation of Gymnasium/CartPole
.
Quick Performance Analysis
DQN
seems to be benefit significantly from fixing this bug.A2C
seem to benefit slightly from fixing this bug.PPO
does not seem to be affected by this bug, as it can already easily solve the problem.
Note: The shaded area is the (min, max)
episode length of each algorithm.
Suggestion
Suggestion 1 (the one I recommend)
Fix this bug and update CartPole
to v2
, and add an argument to recreate the old behavior.
Suggestion 2
Do not fix this bug, but update the description to clearly indicate that this is a “reward-free” variant of the cart pole environment.
Additional references
Old openai/gym
related issues: https://github.com/openai/gym/issues/1682, https://github.com/openai/gym/issues/704, https://github.com/openai/gym/issues/21.
This issue has existed since gym=0.0.1
(hello world commit by gdb)
It is the same issue that MuJoCo/Pendulum
s had https://github.com/Farama-Foundation/Gymnasium/issues/500, and https://github.com/Farama-Foundation/Gymnasium/issues/526
The original CartPole implementation by Sutton did not have a constant reward function, nor did not the paper
Code example
Python 3.11.5 (main, Sep 2 2023, 14:16:33) [GCC 13.2.1 20230801] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import gymnasium
>>> env = gymnasium.make("CartPole-v1")
>>> env.reset()
(array([-0.04924537, 0.01854442, -0.0259891 , 0.00165513], dtype=float32), {})
>>> env.step(env.action_space.sample())
(array([-0.04887448, 0.21402927, -0.025956 , -0.29911307], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.0445939 , 0.4095114 , -0.03193826, -0.5998677 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.03640367, 0.2148505 , -0.04393562, -0.31741348], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.03210666, 0.02038097, -0.05028389, -0.03890362], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.03169904, -0.1739852 , -0.05106196, 0.2374999 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.03517874, -0.3683419 , -0.04631196, 0.5136492 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.04254558, -0.17259932, -0.03603898, 0.20673935], dtype=float32), 1.0, False, False, {})
>>>
>>> env.step(env.action_space.sample())
(array([-0.04599757, -0.3671879 , -0.03190419, 0.4878395 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.05334133, -0.5618455 , -0.0221474 , 0.77029914], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.06457824, -0.36642587, -0.00674142, 0.47073072], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.07190675, -0.17120934, 0.0026732 , 0.17593063], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.07533094, 0.02387426, 0.00619181, -0.11590779], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.07485346, 0.21890694, 0.00387366, -0.4066308 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.07047532, 0.41397375, -0.00425896, -0.69809 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.06219584, 0.21891111, -0.01822076, -0.40675083], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.05781762, 0.02405221, -0.02635578, -0.11986759], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.05733658, -0.17068242, -0.02875313, 0.1643852 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.06075022, 0.02483909, -0.02546543, -0.13722809], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.06025344, -0.16990903, -0.02820999, 0.14731336], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.06365162, -0.3646159 , -0.02526372, 0.4309648 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.07094394, -0.5593712 , -0.01664442, 0.7155777 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.08213136, -0.7542588 , -0.00233287, 1.0029755 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.09721654, -0.9493495 , 0.01772664, 1.2949249 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.11620353, -1.1446922 , 0.04362514, 1.5931041 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.13909738, -0.95011413, 0.07548722, 1.3143365 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.15809965, -1.1461059 , 0.10177395, 1.6296592 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.18102178, -1.342266 , 0.13436714, 1.9522467 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.2078671 , -1.148804 , 0.17341207, 1.7040544 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.23084317, -1.3454461 , 0.20749316, 2.0453217 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.2577521, -1.1529721, 0.2483996, 1.8233697], dtype=float32), 1.0, True, False, {})
# The last steps terminates because the pole has fallen, but the reward we get is `1.0` not `0.0`
Additional context
Checklist
- I have checked that there is no similar issue in the repo
About this issue
- Original URL
- State: closed
- Created 7 months ago
- Comments: 16 (5 by maintainers)
I think my preferred solution would be to add a
sutton_barto: bool = False
parameter in__init__
where the reward function matches sutton and barto or uses the previous implementation (defaulting to previous implementation). This would allow us to keep v1 and just add a parameter to change reward function for users to test with. Then we just update the documentation to reflect these changesDoes that work for you @Kallinteris-Andreas
I think @Kallinteris-Andreas suggestion option 1 is good, but with
reward = 1
for non terminated andreward = -1
for terminated. This will move the gradients up or down (away from 0) in both cases, which is good.I would add to it to ensure that
env.unwrapped.kinematics_integrator = False
is a default or just remove this branch entirely and do the physics correctly. See: https://github.com/openai/gym/issues/3254#issuecomment-1879083763Changing this setting also reduces training time because we store the correct ‘x’ and ‘theta’ values in the state for training and not the previous states x and theta which is what it does now by default.
Additionally, changing the terminated reward to -1 helps to reduces training cycles that do not contribute, at all, to any learning of the policy when the terminated state is also given a reward of 1. Here’s why:
Initial Q values of an randomly initialized network are near zero.
With the default 1 reward for everything, it means that even terminal states will have Q value that approach 1 over time. All of the initial training pushes both “good” middleish states and terminal states up to Q >= 1, and only above 1 do the Q values kick in and do any learning.
I just want to point out that with
-100
for failure and-99
for survival, the optimal behavior would be to end the trajectory as soon as possible, and you would get a very different looking graph.