Gymnasium: [Bug Report] `CartPole`'s reward function constantly returns 1 (even when it falls)

Describe the bug

The CartPole environment provides reward==1 when the pole “stands” and reward==1 when the pole has “fallen”.

The old gym documentation mentioned that this was the behavior, and so does the current documentation, indicating that this is the desired behavior, but I can find no evidence that this was the design goal.

The argument could be made that “state-of-the-art algorithms (e.g. DQN, A2C, PPO) can easily solve this environment anyway, so why bother?”, which is true that the environment is very easy to solve, but it is still important to examine why, and what the consequences are in these learning algorithms and potential future ones.

The reason the algorithms are able to learn a policy, despite the fact that the environment is effectively reward-free, is because of the sampling bias introduced by the number of actions taken per given policy. If the agent has a “good policy”, then on average it will be alive longer than if it has a “bad policy”, and this will cause the “good policy” to be sampled more often on average by the train() subroutine of the learning algorithm, and therefore a “good policy” will be reinforced more than a “bad policy” (instead of the “good policy” being reinforced and the “bad policy” being reduced, which is what normally happens in RL).

This would mean that RL algorithms that are not affected by this sampling bias (or are much less affected by it) would not be able to learn a policy for the CartPole-v1 environment, And the CartPole environment is where many RL algorithms are tested during the prototyping phase (since it is probably assumed that if a learning algorithm cannot solve CartPole, it is probably a dud). Therefore, if an algorithm that was developed was less affected by this sampling bias, it might have failed the prototyping phase because of the wrong implementation of Gymnasium/CartPole.

Quick Performance Analysis

CartPole

DQN seems to be benefit significantly from fixing this bug.
A2C seem to benefit slightly from fixing this bug.
PPO does not seem to be affected by this bug, as it can already easily solve the problem.

Note: The shaded area is the (min, max) episode length of each algorithm.

Suggestion

Suggestion 1 (the one I recommend)

Fix this bug and update CartPole to v2, and add an argument to recreate the old behavior.

Suggestion 2

Do not fix this bug, but update the description to clearly indicate that this is a “reward-free” variant of the cart pole environment.

Additional references

Old openai/gym related issues: https://github.com/openai/gym/issues/1682, https://github.com/openai/gym/issues/704, https://github.com/openai/gym/issues/21.

This issue has existed since gym=0.0.1 (hello world commit by gdb)

It is the same issue that MuJoCo/Pendulums had https://github.com/Farama-Foundation/Gymnasium/issues/500, and https://github.com/Farama-Foundation/Gymnasium/issues/526

The original CartPole implementation by Sutton did not have a constant reward function, nor did not the paper

Code example

Python 3.11.5 (main, Sep  2 2023, 14:16:33) [GCC 13.2.1 20230801] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import gymnasium
>>> env = gymnasium.make("CartPole-v1")
>>> env.reset()
(array([-0.04924537,  0.01854442, -0.0259891 ,  0.00165513], dtype=float32), {})
>>> env.step(env.action_space.sample())
(array([-0.04887448,  0.21402927, -0.025956  , -0.29911307], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.0445939 ,  0.4095114 , -0.03193826, -0.5998677 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.03640367,  0.2148505 , -0.04393562, -0.31741348], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.03210666,  0.02038097, -0.05028389, -0.03890362], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.03169904, -0.1739852 , -0.05106196,  0.2374999 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.03517874, -0.3683419 , -0.04631196,  0.5136492 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.04254558, -0.17259932, -0.03603898,  0.20673935], dtype=float32), 1.0, False, False, {})
>>> 
>>> env.step(env.action_space.sample())
(array([-0.04599757, -0.3671879 , -0.03190419,  0.4878395 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.05334133, -0.5618455 , -0.0221474 ,  0.77029914], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.06457824, -0.36642587, -0.00674142,  0.47073072], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.07190675, -0.17120934,  0.0026732 ,  0.17593063], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.07533094,  0.02387426,  0.00619181, -0.11590779], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.07485346,  0.21890694,  0.00387366, -0.4066308 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.07047532,  0.41397375, -0.00425896, -0.69809   ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.06219584,  0.21891111, -0.01822076, -0.40675083], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.05781762,  0.02405221, -0.02635578, -0.11986759], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.05733658, -0.17068242, -0.02875313,  0.1643852 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.06075022,  0.02483909, -0.02546543, -0.13722809], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.06025344, -0.16990903, -0.02820999,  0.14731336], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.06365162, -0.3646159 , -0.02526372,  0.4309648 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.07094394, -0.5593712 , -0.01664442,  0.7155777 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.08213136, -0.7542588 , -0.00233287,  1.0029755 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.09721654, -0.9493495 ,  0.01772664,  1.2949249 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.11620353, -1.1446922 ,  0.04362514,  1.5931041 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.13909738, -0.95011413,  0.07548722,  1.3143365 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.15809965, -1.1461059 ,  0.10177395,  1.6296592 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.18102178, -1.342266  ,  0.13436714,  1.9522467 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.2078671 , -1.148804  ,  0.17341207,  1.7040544 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.23084317, -1.3454461 ,  0.20749316,  2.0453217 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.2577521, -1.1529721,  0.2483996,  1.8233697], dtype=float32), 1.0, True, False, {})
# The last steps terminates because the pole has fallen, but the reward we get is `1.0` not `0.0`

Additional context

https://github.com/Farama-Foundation/Gymnasium/blob/34872e997c4efe8f2fe8e77ad5d3d4c5579e977c/gymnasium/envs/classic_control/cartpole.py#L188-L203

Checklist

I have checked that there is no similar issue in the repo

About this issue

Original URL
State: closed
Created 7 months ago
Comments: 16 (5 by maintainers)

Most upvoted comments

I think my preferred solution would be to add a sutton_barto: bool = False parameter in __init__ where the reward function matches sutton and barto or uses the previous implementation (defaulting to previous implementation). This would allow us to keep v1 and just add a parameter to change reward function for users to test with. Then we just update the documentation to reflect these changes

Does that work for you @Kallinteris-Andreas

pseudo-rnd-thoughts on Feb 23, 2024

I think @Kallinteris-Andreas suggestion option 1 is good, but with reward = 1 for non terminated and reward = -1 for terminated. This will move the gradients up or down (away from 0) in both cases, which is good.

I would add to it to ensure that env.unwrapped.kinematics_integrator = False is a default or just remove this branch entirely and do the physics correctly. See: https://github.com/openai/gym/issues/3254#issuecomment-1879083763

Changing this setting also reduces training time because we store the correct ‘x’ and ‘theta’ values in the state for training and not the previous states x and theta which is what it does now by default.

Additionally, changing the terminated reward to -1 helps to reduces training cycles that do not contribute, at all, to any learning of the policy when the terminated state is also given a reward of 1. Here’s why:

Initial Q values of an randomly initialized network are near zero.

With the default 1 reward for everything, it means that even terminal states will have Q value that approach 1 over time. All of the initial training pushes both “good” middleish states and terminal states up to Q >= 1, and only above 1 do the Q values kick in and do any learning.

wjessup on Jan 19, 2024

I just want to point out that with -100 for failure and -99 for survival, the optimal behavior would be to end the trajectory as soon as possible, and you would get a very different looking graph.

balazsgyenes on Dec 11, 2023