scipy: stats.mode's nan_policy 'propagate' not working?

I am not sure whether my issue is a bug or a mis-understanding of how nan_policy='propagate' setting in scipy.stats.mode() is supposed to work. Specifically, in the documentation it says:

[nan_policy] Defines how to handle when input contains nan. 'propagate' returns nan

However, I do not get nan. Why?

Reproducing code example:

>>> import numpy as np
>>> from scipy import stats
>>> stats.mode([1., 1., 2., 3., 3., 3., np.nan], nan_policy='propagate').mode
array([3.])
>>> stats.mode([np.nan, np.nan, np.nan, 1., 1., 2., 3., 3., 3., np.nan],
...            nan_policy='propagate').mode
array([3.])

Scipy/Numpy/Python version information:

>>> import sys, scipy, numpy
>>> print(scipy.__version__, numpy.__version__, sys.version_info)
1.2.0 1.15.4 sys.version_info(major=3, minor=6, micro=8, releaselevel='final', serial=0)

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 27 (18 by maintainers)

Most upvoted comments

@chrisb83 wrote

for me, propagate means to keep NaN values, then perform the calculation and see what happens.

I think that might be how the behavior of 'propagate' was originally envisioned, but my impression is that experience with nan_policy is showing that approach to be a mistake. These days, I suggest that when we add nan_policy to a function, we think carefully about how nan should be propagated (just like folks are doing in this issue!), and then implement that behavior. Otherwise, nan handling tends to be implementation-dependent, and the behavior could change if the implementation changes. (See, for example, the surprising effect of propagating nan in ndimage.uniform_filter reported in https://github.com/scipy/scipy/issues/7818. We don’t have nan_policy implemented there, but if we did, I think we would still consider the existing behavior of that function undesired.) In many cases, we can probably just let the nans pass through to the regular calculation, so a special implementation is not required, and the only extra work involved is some up-front thinking about how propagate should behave, and then documenting it and providing some unit tests for it.

Another issue to consider is that there is not just one bit pattern that corresponds to nan. In the IEEE floating point representation of nan, the exponent bits are all one, but the sign bit can be 0 or 1, and the significand can have any nonzero value. So there are many bit patterns that are interpreted as nan. Should mode treat nans with different bit patterns as distinct values? I don’t have a strong opinion about this, and I suspect it will rarely come up in practice. But whatever the preferred behavior, it should be specified in the docstring and tested in the unit tests.

Returning to the issue of mode(), I can’t see when the mode would return a nan

I can’t either. And is definitely not what should occur according to the (current) documentation. IF we define propagate as “propagate means keep NaN and see what happens” this behaviour isn’t wrong, only unusual.

Yeah, the current implementation is odd. mode is a bit trickier than normal. For most functions it’s easy to see what “propagating nans” means. Here the end result is implementation-specific, which isn’t really desired. I think stats.mode([2, np.nan, 1, np.nan]) should return nan, not 1. If we had implemented it by counting nans with np.isnan(a).sum(), then the result would be (to me) more intuitive.

I guess I am confused about “However” because it sounds as if sem() is a counter-example of “should return nan if any nan values occur”.

@mcara Yes, you’re correct, the “However” should probably have been left out as it is confusing. I wanted to highlight what usually happens.

When a computation is performed along an axis, IMO it is “natural” to expect that computations along other axes are independent of one another.

The documentation should make this explicit. A “natural” expectation may be different from person to person. e.g. returns nan could be interpreted as:

  1. Returns a single nan value.
  2. Returns an array of nans with the expected output’s shape.
  3. Returns an array with nans only where the calculation is affected.

Option 3 makes the most sense to me, but, if this is what should actually be happening, it should be explicitly stated in the documentation.

Returning to the issue of mode(), I can’t see when the mode would return a nan

I can’t either. And is definitely not what should occur according to the (current) documentation. IF we define propagate as “propagate means keep NaN and see what happens” this behaviour isn’t wrong, only unusual.

This leads to the question: what is the meaning of nan_policy=‘propagate’ when applied to mode()?

I was hoping #9947 would lead to some discussion about this. Based on the discussion here I agree with @chrisb83. The behaviour of mode is fine, BUT, the documentation should reflect the behaviour (as is happening in #9948).

@mcara sorry, I don’t think I came across as clearly as I could have.

It is not clear what documentation says but it is definitely not that mode() would return nan

From the documentation of propagate:

https://github.com/scipy/scipy/blob/be3bcf99042f46344007621b006ff90258fed914/scipy/stats/stats.py#L408-L409

to me 'propagate' returns nan implies nan_policy='propagate' should return, well, nan. Yes, even when one nan is in the input list. I’m not saying this is the best way, but what the documentation says should happen. This also seems to be the behaviour you were expecting in the original post (“However, I do not get nan. Why?”).

Possibly my question is about the original intention for the 'propagate' option.

As @chrisb83 says above:

propagate means keep NaN and see what happens

An example of the usual behaviour is sem (standard error of the mean) where, in any axis effected by a nan value, a nan value is returned:

>>> import numpy as np
>>> from scipy.stats import sem
>>> a = np.arange(16, dtype=np.float64).reshape((4, 4))
>>> sem(a) #sem with no nan values
array([2.5819889, 2.5819889, 2.5819889, 2.5819889])
>>> a[1, 1] = np.nan
>>> sem(a, nan_policy='propagate')
array([2.5819889,       nan, 2.5819889, 2.5819889])

However, as you have pointed out, mode does not follow this pattern. mode calls np.unique which (as shown above) treats all nan values as a unique element.

>>> import numpy as np
>>> from scipy.stats import mode
>>> a = np.array([[1, 1, 2, 3], [1, 2, 2, np.nan], [1, 1, 1, np.nan]])
>>> a
array([[ 1.,  1.,  2.,  3.],
       [ 1.,  2.,  2.,  nan.],
       [ 1.,  1.,  1., nan]])
>>> mode(a, nan_policy='propagate')
ModeResult(mode=array([[1., 1., 2., 3.]]), count=array([[3, 2, 2, 1]]))

…which is not what you would expect based on the documented behaviour (return nan).

2. What would be the right thing to do?

As @chrisb83 said the behaviour seems reasonable (given his definition of propagate) but does not align with what is documented. The nan value is still propagated, but the behaviour is unusual. Enough so to warrant further explanation/examples.

You are right, I did not see the second example.

It seems to be related to the nan-handling of np.unique, i.e. nan values are not grouped:

a = [np.nan, np.nan, np.nan, 1., 1., 2., 3., 3., 3., np.nan]
vals, cnts = np.unique(a, return_counts=True)
vals # array([ 1.,  2.,  3., nan, nan, nan, nan])
cnts # array([2, 1, 3, 1, 1, 1, 1], dtype=int64)