ray: [rllib] MeanStdFilter value problem during compute action

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
  • Ray installed from (source or binary): source
  • Ray version: 0.5.0
  • Python version: 3.6.1
  • Exact command to reproduce:

Describe the problem

When I run agents have been trained on for evaluation, I found that the agent doesn’t work well like the rewards I’ve monitored.

And I’ve found that when I restore agent, the filter(MeanStdFilter) has somewhat strange values. Have you ever heard of such a problem?

Source code / logs

agent = cls(env="prosthetics", config=config)
agent.restore(checkpoint_path)
print(agent.local_evaluator.filters)
>>> {'default': MeanStdFilter((158,), True, True, None, (n=128033777, mean_mean=-1.0975956375310988e+181, mean_std=inf), (n=0, mean_mean=0.0, mean_std=0.0))}

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 22 (3 by maintainers)

Most upvoted comments

@whikwon @ericl Thanks for all the guidance. So here is what I’ve found: The environment is Pendulum-v0. In agent.compute_action, I log the values of obs before and after the filter. As you will see below, the filtered values are either extremely small or large.

[-0.87094525  0.49138007 -2.64594034]
[-8.70945248e+07  4.91380071e+07 -2.64594034e+08]
 
[-0.81813912  0.57502033 -1.97910756]
[-8.18139124e+07  5.75020325e+07 -1.97910756e+08]
 
[-0.78058041  0.62505538 -1.25146981]
[-7.80580405e+07  6.25055383e+07 -1.25146981e+08]
 
[-0.74561678  0.66637499 -1.08267827]
[-7.45616776e+07  6.66374987e+07 -1.08267827e+08]
 
[-0.71548291  0.69863024 -0.88289703]
[-71548290.6009737   69863023.92595541 -88289703.02494954]
 
[-0.69208157  0.7218193  -0.65892435]
[-69208156.94613262  72181929.95562999 -65892435.08048297]
 
[-0.67686169  0.73611021 -0.41755988]
[-67686169.49385323  73611021.31643993 -41755987.61376046]
 
[-0.67074812  0.74168521 -0.16547722]
[-67074812.32289593  74168521.30013305 -16547721.62643052]
 
[-0.67422883  0.73852251  0.09405967]
[-67422882.58069217  73852250.50403133   9405966.79778121]
 
[-0.68740092  0.72627817  0.35968684]
[-68740091.88581493  72627816.76141532  35968683.70469721]

So it seems like the filter is not applied correctly in the code below:

 filtered_obs = self.local_evaluator.filters[policy_id](
    observation, update=False) 

You can probably add a print() to determine what update causes it to reach an inf value.