scipy: BUG: handle inconsistencies in factorial functions and their extensions
In the context of #15351 and reviewing #15299, I noticed a bunch of inconsistencies that should be fixed IMO.
TL;DR
Courtesy of @tirthasheshpatel:
factorial
,factorial2
, andfactorialk
behave inconsistently for several different combinations of inputs. (for example,factorial2
returning 0-dim arrays for scalars and,factorial2
andfactorialk
aren’t vectorized). See the first table for more details. This issue can be tacked by adding a new parameter likelegacy
which, by default, preserves the old behaviour but when false, handles the current inconsistent behavior correctly.- Support for float and complex inputs can easily be added using this extension. This enhancement adds a new boolean parameter
extend='zero'|'complex'
(and perhaps'recurrence'
down the line) that specifies whether to use the extension formula or the default formula. See the second table for more details.
Design Space
There are several dimensions that come into play:
- scalar vs. array inputs
- for integers: approximation or exact computation (
exact=
) - extensions to negative integers
- extensions to floats
- extensions to complex numbers
- extensions to poles (possible e.g. for
factorial2
) - for the array-case, failing soft (e.g. inserting NaNs) or hard (raising error)
Current Behaviour (as of 1.8.0)
Inputs | factorial( exact=False) |
factorial( exact=True) |
factorial2( exact=False) |
factorial2( exact=True) |
factorialk( exact=False) |
factorialk( exact=True) |
---|---|---|---|---|---|---|
Scalar positive integer | float | int | !! 0-dim float array !! |
int | NotImpl Error |
int |
Scalar negative integer | 0.0 | 0 | !! array(0.) !! |
0 | NotImpl Error |
0 |
Scalar positive float | float | ValueError (but text misleading) |
TypeError |
TypeError |
NotImpl Error |
TypeError |
Scalar negative float | 0.0 | 0 | TypeError |
TypeError |
NotImpl Error |
TypeError |
Scalar complex | TypeError |
TypeError |
TypeError |
TypeError |
NotImpl Error |
TypeError |
Scalar NaN | NaN | NaN | TypeError |
TypeError |
NotImpl Error |
TypeError |
Positive integer in array | float | int | float | !! TypeError !! |
NotImpl Error |
TypeError |
Negative integer in array | 0.0 | 0 | 0.0 | !! TypeError !! |
NotImpl Error |
TypeError |
Positive float in array | float | ValueError (but text misleading) |
TypeError |
TypeError |
NotImpl Error |
TypeError |
Negative float in array | 0.0 | 0 | TypeError |
TypeError |
NotImpl Error |
TypeError |
Complex in array | TypeError |
TypeError |
TypeError |
TypeError |
NotImpl Error |
TypeError |
NaN in array | NaN | NaN | TypeError |
TypeError |
NotImpl Error |
TypeError |
Empty array | array([]) |
IndexError |
TypeError |
TypeError |
NotImpl Error |
TypeError |
Note that:
factorial2(<scalar>, exact=False)
returns a zero-dimensional array instead of a scalar.- All the
TypeErrors
are IMO implementation oversights (and in case of “!!”, against documentation) - i.e. we should provide better error messages. - There is a full generalization of
factorialk
which is basically just a fraction of two gamma functions, which would work for all complex n & k (up to poles), as well as arrays.
Extension Issues
In addition to this hodgepodge of behaviour, some people would like to have the float/complex extensions of these functions (e…g. https://github.com/scipy/scipy/pull/15349). I tend to agree, but that raises another problem:
Complex extensions are incompatible with current inputs ≤0
First off, everything that’s 0 currently on the negative axis gets a different value. But also note the “≤”; namely, 0!! is 1 in the regular definition, but ~0.79=sqrt(2/pi) in the extension.
Extensions based on recurrence are incompatible with current status as well as complex extension
This came up for example in #15299. It’s possible (for example) to invert the recurrence relation n!! = n * (n-2)!!, and so extend the double-factorial to negative integers. The resulting value would be neither 0 (currently) nor the pole in the complex extension.
Design Considerations
If we (or at least I) were designing from scratch:
- Changing from the
exact=False
toexact=True
should never fundamentally change the result (e.g. going from 0 to non-zero, or casting inputs to integers as seen in #15351).- As such, the exact keyword only really matters to positive scalar integers; in all other cases, it could be retired, resp. raise, or simply be turned into a no-op (cf. #15359)
- Open question whether it’s worth to properly implement
exact=True
for array-case (currently, onlyfactorial
does).
- Complex extensions (together with non-positive floats) should be behind a keyword, roughly because:
- 0-for-negative-values is a valid approach, and has been in scipy ~forever.
- Since the values being produced are incompatible (for certain inputs), the complex extension should be explicitly opt-in.
- Using recurrence relations for extending into poles is an open question; not sure if worth it, but if so, probably with another ~keyword~ value for the
extend
keyword. - Use the most general approximation, except when alternatives with better accuracy / performance exist.
- Probably need to think about dtype casting rules for empty arrays (keep dtype? always float? …)
Possible Future Behaviour (note extend=
, not exact=
)
Inputs | factorial( extend= 'zero') |
factorial( extend= 'complex') |
factorial2( extend= 'zero') |
factorial2( extend= 'complex') |
factorialk( extend= 'zero') |
factorialk( extend= 'complex') |
---|---|---|---|---|---|---|
Scalar positive integer w/ exact=True |
n! | n! | n!! | n!! | n!(k) | n!(k) |
Scalar positive integer | Γ(n+1) | Γ(n+1) | (1) | (1) | (2) | (2) |
Scalar negative integer / poles | 0.0 | NaN | 0.0 | NaN | 0.0 | NaN |
Scalar positive float | Γ(n+1) | Γ(n+1) | (1) | (1) | (2) | (2) |
Scalar negative float | 0.0 | Γ(n+1) | 0.0 | (1) | 0.0 | (2) |
Scalar complex | ValueError |
Γ(n+1) | ValueError |
(1) | ValueError |
(2) |
Scalar NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Positive integer in array | Γ(n+1) | Γ(n+1) | (1) | (1) | (2) | (2) |
Negative integer in array | 0.0 | NaN | 0.0 | NaN | 0.0 | NaN |
Positive float in array | Γ(n+1) | Γ(n+1) | (1) | (1) | (2) | (2) |
Negative float in array | 0.0 | Γ(n+1) | 0.0 | (1) | 0.0 | (2) |
Complex in array | ValueError |
Γ(n+1) | ValueError |
(1) | ValueError |
(2) |
NaN in array | NaN | NaN | NaN | NaN | NaN | NaN |
Empty array | array([]) |
array([]) |
array([]) |
array([]) |
array([]) |
array([]) |
The formulas in question are:
- (1): z!! = 2^((z-1)/2) Γ(z/2 + 1) / Γ(1/2 + 1)
- (2): z!(α) = α^((z-1)/α) Γ(z/α + 1) / Γ(1/α + 1)
I’d very much welcome feedback on this - there are some trade-offs to be made (e.g. whether negative floats should also raises without extend='complex'
or whether the errors in the array-case should just map to NaN; also there are different approximations for n!! in use currently), but I think the above would be reasonable. In particular, the scalar and array case behave the same way.
Implementation Plan
I would propose to introduce another keyword (sorry…) legacy
, defaulting to True
, but implement legacy=False
with all the corrected behaviour. Where there’s any planned change in behaviour, we raises a deprecation warning recommending to switch to legacy=False
. After 2 releases, we switch the default from legacy=True
to legacy=False
, after two more releases we raise on legacy=True
, and after two more we remove the keyword.
Looking more closely at relevant changes in (currently non-raising) behaviour, it really only affects the zero-dimensional arrays for factorial2(scalar_val, exact=False)
. In particular, if we let exact=True
be a no-op for non-integers (the less “noisy” options from the point of UX), we would keep essentially all current non-raising behaviour intact.
CC @steppi @tirthasheshpatel @j-bowhay @tupui
Edit History:
- v4 (2022-03-10): Make
exact=True
a no-op for non-integers (less noisy), following #15359. Extend deprecation period oflegacy
-keyword along the same lines as proposed in https://github.com/scipy/scipy/issues/15722. - v3 (2022-02-16; up to rev 20): Added (current & proposed) behaviour for NaN and empty arrays; good catch @tirthasheshpatel!
- v2 (2022-02-16; up to rev 15): Added TL;DR, clarification on impact of
legacy
, changedextend
from boolean argument to take an enum. - v1 (2022-02-16; up to rev 10): Original
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 3
- Comments: 32 (32 by maintainers)
Thanks for the incredibly detailed report @h-vetinari 🤯! I suggest adding a TLDR; something like this if I understand the issue correctly:
factorial
,factorial2
, andfactorialk
behave inconsistently for several different combinations of inputs. (for example,factorial2
returning 0-dim arrays for scalars and,factorial2
andfactorialk
aren’t vectorized). See the first table for more details. This issue can be tacked by adding a new parameter likelegacy
which, by default, preserves the old behaviour but when false, handles the current inconsistent behavior correctly.extend
that specifies whether to use the extension formula or the default formula. See the second table for more details.Can this issue be split into two: one for addressing inconsistencies and one for adding the extension?
This is somewhat common in
stats
too. Although it’s an inconsistency, such outputs mostly behave like scalars (and also get coerced into one if operated with scalars). I think there are more important things we should tackle first: for example, vectorizingfactorial2
andfactorialk
(these functions throwing for array inputs is very unintuitive).I agree that raising would be nice but that would again be a backward-incompatible change. Rather, we can add a keyword, similar to
nan_policy
, that asks the user whether to coerce the inputs to integers, raise, or fill NaNs for invalid cases in array inputs (e.g.exact_policy='raise' | 'coerce' | 'fill-nan'
). Forfactorial
, the default could be'coerce'
. Forfactorial2
andfactorialk
,raise
.This sounds good.
I am not against this but maybe we can also consider an alternative approach like adding
exact_policy
?I consider this issue solved by #15841. For the open points regarding the extensions, I’ve opened a follow-up issue: https://github.com/scipy/scipy/issues/18409
Almost in every software I used, if we have a multiple function doing similar but different versions of a functionality 5 years down the line we struggle deprecating/removing them.
I would have liked this function to be in NumPy namespace since it is a very common math operation or from standard
math
library but it is now living inspecial
functions territory because reasons.So at least can we combine them into a single signature since they are special cases of k=1, 2 and get rid of the others eventually in the future?
Next problem - the “universal” approximations for n!! is very bad for even numbers (the odd ones work well). For example, the formula (1) from the OP, calculated to high precision with mpmath gives: 10!! = 3063.8767134830029.
However, the true value is very different:
and the current approximation actually does an extremely good job (down to a single eps even for n=100)
However, the formula picked up by @j-bowhay in #15349 seems to do a much better job also for even integers.
While working on this, I just stumbled over
test_factorial_0d_return_type
, and thus https://github.com/scipy/scipy/issues/7400 & https://github.com/scipy/scipy/pull/10954. This shows that the 0d array-output forfactorial2
was already undesired a long time ago, and was merely overlooked in #10954.The reason this is relevant is that if we treat it as a bugfix (similar to #10954), then we’d have no deprecations to do for the other points in this issue.
However, I wanted to tag the participants of those old discussion, because I think #10954 went too far in fixing #7400 - I don’t think the factorial-functions should be changing array-inputs (even zero-dimensional ones) to scalars. IMO it should be exclusively:
CC @WarrenWeckesser @person142 @rgommers @larsoner
Absolutely! 😃
IMO we should aim basically for all the cases in the first table that don’t raise (except NaN/empty array), and some mixed inputs in the array case (though the negative inputs returning 0 are presumably less interesting).
Agreed, this was my hesitation in making them thin wrappers as there is a potential loss in performance and accuracy as some cases have more optimal ways of performing the calculation that don’t generalise.
In
benchmarks/benchmarks/special.py
there are currently no benchmarks for any of the factorial stuff. Maybe a good first step would be to add some benchmarks?@tirthasheshpatel
🙃
Done, thanks!
Fine by me, and effectively adopted with your TL;DR.
Prioritization aside, I think we should clean this up. 0-dim arrays used to be a bit of a wild west thing, but it’s getting cleaned up in lots of APIs (not just numpy/scipy, also other downstreams like cvxpy).
I’m not a fan, to be honest. It has the same high price (another new keyword) for very little gain IMO. In particular, I think coerce is basically never the right thing, and if so, then people should get the warning/error and explicitly cast.
We’ll have to do some bikeshedding on this. I proposed
extend=True|False
(more or less as a stand-in, becauseextend_complex
seemed too long), but if we want to leave the recurrence thing on the table, we could actually doextend='zero'|'complex'|'recurrence'
(with'zero'
being the default and'recurrence'
potentially being added later), which to me looks even better than boolean switches (even if'recurrence'
never materializes)I understand that it’s not attractive, but I had a closer look and added this to the OP:
In other words, that’s the one behavioural change that we cannot avoid anyway, and using it for that aspect is IMO a reasonable transition (and only one affected function). We could still deprecate certain combinations that are currently non-raising by deprecating them, also with IMO acceptably little impact.