scipy: stats.mode's nan_policy 'propagate' not working?
I am not sure whether my issue is a bug or a mis-understanding of how nan_policy='propagate'
setting in scipy.stats.mode()
is supposed to work. Specifically, in the documentation it says:
[
nan_policy
] Defines how to handle when input containsnan
.'propagate'
returnsnan
…
However, I do not get nan
. Why?
Reproducing code example:
>>> import numpy as np
>>> from scipy import stats
>>> stats.mode([1., 1., 2., 3., 3., 3., np.nan], nan_policy='propagate').mode
array([3.])
>>> stats.mode([np.nan, np.nan, np.nan, 1., 1., 2., 3., 3., 3., np.nan],
... nan_policy='propagate').mode
array([3.])
Scipy/Numpy/Python version information:
>>> import sys, scipy, numpy
>>> print(scipy.__version__, numpy.__version__, sys.version_info)
1.2.0 1.15.4 sys.version_info(major=3, minor=6, micro=8, releaselevel='final', serial=0)
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 27 (18 by maintainers)
@chrisb83 wrote
I think that might be how the behavior of
'propagate'
was originally envisioned, but my impression is that experience withnan_policy
is showing that approach to be a mistake. These days, I suggest that when we addnan_policy
to a function, we think carefully about hownan
should be propagated (just like folks are doing in this issue!), and then implement that behavior. Otherwise,nan
handling tends to be implementation-dependent, and the behavior could change if the implementation changes. (See, for example, the surprising effect of propagatingnan
inndimage.uniform_filter
reported in https://github.com/scipy/scipy/issues/7818. We don’t havenan_policy
implemented there, but if we did, I think we would still consider the existing behavior of that function undesired.) In many cases, we can probably just let thenan
s pass through to the regular calculation, so a special implementation is not required, and the only extra work involved is some up-front thinking about howpropagate
should behave, and then documenting it and providing some unit tests for it.Another issue to consider is that there is not just one bit pattern that corresponds to
nan
. In the IEEE floating point representation ofnan
, the exponent bits are all one, but the sign bit can be 0 or 1, and the significand can have any nonzero value. So there are many bit patterns that are interpreted asnan
. Shouldmode
treatnan
s with different bit patterns as distinct values? I don’t have a strong opinion about this, and I suspect it will rarely come up in practice. But whatever the preferred behavior, it should be specified in the docstring and tested in the unit tests.Yeah, the current implementation is odd.
mode
is a bit trickier than normal. For most functions it’s easy to see what “propagating nans” means. Here the end result is implementation-specific, which isn’t really desired. I thinkstats.mode([2, np.nan, 1, np.nan])
should returnnan
, not1
. If we had implemented it by counting nans withnp.isnan(a).sum()
, then the result would be (to me) more intuitive.@mcara Yes, you’re correct, the “However” should probably have been left out as it is confusing. I wanted to highlight what usually happens.
The documentation should make this explicit. A “natural” expectation may be different from person to person. e.g.
returns nan
could be interpreted as:Option 3 makes the most sense to me, but, if this is what should actually be happening, it should be explicitly stated in the documentation.
I can’t either. And is definitely not what should occur according to the (current) documentation. IF we define propagate as “propagate means keep NaN and see what happens” this behaviour isn’t wrong, only unusual.
I was hoping #9947 would lead to some discussion about this. Based on the discussion here I agree with @chrisb83. The behaviour of
mode
is fine, BUT, the documentation should reflect the behaviour (as is happening in #9948).@mcara sorry, I don’t think I came across as clearly as I could have.
From the documentation of propagate:
https://github.com/scipy/scipy/blob/be3bcf99042f46344007621b006ff90258fed914/scipy/stats/stats.py#L408-L409
to me
'propagate' returns nan
impliesnan_policy='propagate'
should return, well,nan
. Yes, even when onenan
is in the input list. I’m not saying this is the best way, but what the documentation says should happen. This also seems to be the behaviour you were expecting in the original post (“However, I do not getnan
. Why?”).As @chrisb83 says above:
An example of the usual behaviour is
sem
(standard error of the mean) where, in any axis effected by anan
value, anan
value is returned:However, as you have pointed out,
mode
does not follow this pattern.mode
callsnp.unique
which (as shown above) treats allnan
values as a unique element.…which is not what you would expect based on the documented behaviour (return
nan
).As @chrisb83 said the behaviour seems reasonable (given his definition of propagate) but does not align with what is documented. The
nan
value is still propagated, but the behaviour is unusual. Enough so to warrant further explanation/examples.You are right, I did not see the second example.
It seems to be related to the nan-handling of
np.unique
, i.e. nan values are not grouped: