scipy: BUG: handle inconsistencies in factorial functions and their extensions

In the context of #15351 and reviewing #15299, I noticed a bunch of inconsistencies that should be fixed IMO.

TL;DR

factorial, factorial2, and factorialk behave inconsistently for several different combinations of inputs. (for example, factorial2 returning 0-dim arrays for scalars and, factorial2 and factorialk aren’t vectorized). See the first table for more details. This issue can be tacked by adding a new parameter like legacy which, by default, preserves the old behaviour but when false, handles the current inconsistent behavior correctly.
Support for float and complex inputs can easily be added using this extension. This enhancement adds a new boolean parameter extend='zero'|'complex' (and perhaps 'recurrence' down the line) that specifies whether to use the extension formula or the default formula. See the second table for more details.

Design Space

There are several dimensions that come into play:

scalar vs. array inputs
for integers: approximation or exact computation (exact=)
extensions to negative integers
extensions to floats
extensions to complex numbers
extensions to poles (possible e.g. for factorial2)
for the array-case, failing soft (e.g. inserting NaNs) or hard (raising error)

Current Behaviour (as of 1.8.0)

Inputs	`factorial(` `exact=False)`	`factorial(` `exact=True)`	`factorial2(` `exact=False)`	`factorial2(` `exact=True)`	`factorialk(` `exact=False)`	`factorialk(` `exact=True)`
Scalar positive integer	float	int	!! 0-dim float array !!	int	`NotImpl` `Error`	int
Scalar negative integer	0.0	0	!! `array(0.)` !!	0	`NotImpl` `Error`	0
Scalar positive float	float	`ValueError` (but text misleading)	`TypeError`	`TypeError`	`NotImpl` `Error`	`TypeError`
Scalar negative float	0.0	0	`TypeError`	`TypeError`	`NotImpl` `Error`	`TypeError`
Scalar complex	`TypeError`	`TypeError`	`TypeError`	`TypeError`	`NotImpl` `Error`	`TypeError`
Scalar NaN	NaN	NaN	`TypeError`	`TypeError`	`NotImpl` `Error`	`TypeError`
Positive integer in array	float	int	float	!! `TypeError` !!	`NotImpl` `Error`	`TypeError`
Negative integer in array	0.0	0	0.0	!! `TypeError` !!	`NotImpl` `Error`	`TypeError`
Positive float in array	float	`ValueError` (but text misleading)	`TypeError`	`TypeError`	`NotImpl` `Error`	`TypeError`
Negative float in array	0.0	0	`TypeError`	`TypeError`	`NotImpl` `Error`	`TypeError`
Complex in array	`TypeError`	`TypeError`	`TypeError`	`TypeError`	`NotImpl` `Error`	`TypeError`
NaN in array	NaN	NaN	`TypeError`	`TypeError`	`NotImpl` `Error`	`TypeError`
Empty array	`array([])`	`IndexError`	`TypeError`	`TypeError`	`NotImpl` `Error`	`TypeError`

Note that:

factorial2(<scalar>, exact=False) returns a zero-dimensional array instead of a scalar.
All the TypeErrors are IMO implementation oversights (and in case of “!!”, against documentation) - i.e. we should provide better error messages.
There is a full generalization of factorialk which is basically just a fraction of two gamma functions, which would work for all complex n & k (up to poles), as well as arrays.

Extension Issues

In addition to this hodgepodge of behaviour, some people would like to have the float/complex extensions of these functions (e…g. https://github.com/scipy/scipy/pull/15349). I tend to agree, but that raises another problem:

Complex extensions are incompatible with current inputs ≤0

First off, everything that’s 0 currently on the negative axis gets a different value. But also note the “≤”; namely, 0!! is 1 in the regular definition, but ~0.79=sqrt(2/pi) in the extension.

Extensions based on recurrence are incompatible with current status as well as complex extension

This came up for example in #15299. It’s possible (for example) to invert the recurrence relation n!! = n * (n-2)!!, and so extend the double-factorial to negative integers. The resulting value would be neither 0 (currently) nor the pole in the complex extension.

Design Considerations

If we (or at least I) were designing from scratch:

Changing from the exact=False to exact=True should never fundamentally change the result (e.g. going from 0 to non-zero, or casting inputs to integers as seen in #15351).
- As such, the exact keyword only really matters to positive scalar integers; in all other cases, it could be retired, resp. raise, or simply be turned into a no-op (cf. #15359)
- Open question whether it’s worth to properly implement exact=True for array-case (currently, only factorial does).
Complex extensions (together with non-positive floats) should be behind a keyword, roughly because:
- 0-for-negative-values is a valid approach, and has been in scipy ~forever.
- Since the values being produced are incompatible (for certain inputs), the complex extension should be explicitly opt-in.
Using recurrence relations for extending into poles is an open question; not sure if worth it, but if so, probably with another ~keyword~ value for the extend keyword.
Use the most general approximation, except when alternatives with better accuracy / performance exist.
Probably need to think about dtype casting rules for empty arrays (keep dtype? always float? …)

Possible Future Behaviour (note `extend=`, not `exact=`)

Inputs	`factorial(` `extend=` `'zero')`	`factorial(` `extend=` `'complex')`	`factorial2(` `extend=` `'zero')`	`factorial2(` `extend=` `'complex')`	`factorialk(` `extend=` `'zero')`	`factorialk(` `extend=` `'complex')`
Scalar positive integer w/ `exact=True`	n!	n!	n!!	n!!	n!(k)	n!(k)
Scalar positive integer	Γ(n+1)	Γ(n+1)	(1)	(1)	(2)	(2)
Scalar negative integer / poles	0.0	NaN	0.0	NaN	0.0	NaN
Scalar positive float	Γ(n+1)	Γ(n+1)	(1)	(1)	(2)	(2)
Scalar negative float	0.0	Γ(n+1)	0.0	(1)	0.0	(2)
Scalar complex	`ValueError`	Γ(n+1)	`ValueError`	(1)	`ValueError`	(2)
Scalar NaN	NaN	NaN	NaN	NaN	NaN	NaN
Positive integer in array	Γ(n+1)	Γ(n+1)	(1)	(1)	(2)	(2)
Negative integer in array	0.0	NaN	0.0	NaN	0.0	NaN
Positive float in array	Γ(n+1)	Γ(n+1)	(1)	(1)	(2)	(2)
Negative float in array	0.0	Γ(n+1)	0.0	(1)	0.0	(2)
Complex in array	`ValueError`	Γ(n+1)	`ValueError`	(1)	`ValueError`	(2)
NaN in array	NaN	NaN	NaN	NaN	NaN	NaN
Empty array	`array([])`	`array([])`	`array([])`	`array([])`	`array([])`	`array([])`

The formulas in question are:

(1): z!! = 2^((z-1)/2) Γ(z/2 + 1) / Γ(1/2 + 1)
(2): z!(α) = α^((z-1)/α) Γ(z/α + 1) / Γ(1/α + 1)

I’d very much welcome feedback on this - there are some trade-offs to be made (e.g. whether negative floats should also raises without extend='complex' or whether the errors in the array-case should just map to NaN; also there are different approximations for n!! in use currently), but I think the above would be reasonable. In particular, the scalar and array case behave the same way.

Implementation Plan

I would propose to introduce another keyword (sorry…) legacy, defaulting to True, but implement legacy=False with all the corrected behaviour. Where there’s any planned change in behaviour, we raises a deprecation warning recommending to switch to legacy=False. After 2 releases, we switch the default from legacy=True to legacy=False, after two more releases we raise on legacy=True, and after two more we remove the keyword.

Looking more closely at relevant changes in (currently non-raising) behaviour, it really only affects the zero-dimensional arrays for factorial2(scalar_val, exact=False). In particular, if we let exact=True be a no-op for non-integers (the less “noisy” options from the point of UX), we would keep essentially all current non-raising behaviour intact.

CC @steppi @tirthasheshpatel @j-bowhay @tupui

Edit History:

v4 (2022-03-10): Make exact=True a no-op for non-integers (less noisy), following #15359. Extend deprecation period of legacy-keyword along the same lines as proposed in https://github.com/scipy/scipy/issues/15722.
v3 (2022-02-16; up to rev 20): Added (current & proposed) behaviour for NaN and empty arrays; good catch @tirthasheshpatel!
v2 (2022-02-16; up to rev 15): Added TL;DR, clarification on impact of legacy, changed extend from boolean argument to take an enum.
v1 (2022-02-16; up to rev 10): Original

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 3
Comments: 32 (32 by maintainers)

Most upvoted comments

Thanks for the incredibly detailed report @h-vetinari 🤯! I suggest adding a TLDR; something like this if I understand the issue correctly:

factorial, factorial2, and factorialk behave inconsistently for several different combinations of inputs. (for example, factorial2 returning 0-dim arrays for scalars and, factorial2 and factorialk aren’t vectorized). See the first table for more details. This issue can be tacked by adding a new parameter like legacy which, by default, preserves the old behaviour but when false, handles the current inconsistent behavior correctly.
Support for float and complex inputs can easily be added using this extension. This enhancement adds a new boolean parameter extend that specifies whether to use the extension formula or the default formula. See the second table for more details.

Can this issue be split into two: one for addressing inconsistencies and one for adding the extension?

factorial2(<scalar>, exact=False) returns a zero-dimensional array instead of a scalar.

This is somewhat common in stats too. Although it’s an inconsistency, such outputs mostly behave like scalars (and also get coerced into one if operated with scalars). I think there are more important things we should tackle first: for example, vectorizing factorial2 and factorialk (these functions throwing for array inputs is very unintuitive).

As such, the exact keyword only really matters to positive scalar integers; in all other cases, it could be retired, resp. raise.

I agree that raising would be nice but that would again be a backward-incompatible change. Rather, we can add a keyword, similar to nan_policy, that asks the user whether to coerce the inputs to integers, raise, or fill NaNs for invalid cases in array inputs (e.g. exact_policy='raise' | 'coerce' | 'fill-nan'). For factorial, the default could be 'coerce'. For factorial2 and factorialk, raise.

Complex extensions (together with non-positive floats) should be behind a keyword

This sounds good.

I would propose to introduce another keyword (sorry…) legacy, defaulting to True, but implement legacy=False with all the corrected behaviour. Where there’s any planned change in behaviour, we raises a deprecation warning recommending to switch to legacy=False. After 2 releases, we switch the default from legacy=True to legacy=False, and after two more releases we remove the keyword.

I am not against this but maybe we can also consider an alternative approach like adding exact_policy?

tirthasheshpatel on Feb 16, 2022

I consider this issue solved by #15841. For the open points regarding the extensions, I’ve opened a follow-up issue: https://github.com/scipy/scipy/issues/18409

h-vetinari on May 3, 2023

Almost in every software I used, if we have a multiple function doing similar but different versions of a functionality 5 years down the line we struggle deprecating/removing them.

I would have liked this function to be in NumPy namespace since it is a very common math operation or from standard math library but it is now living in special functions territory because reasons.

So at least can we combine them into a single signature since they are special cases of k=1, 2 and get rid of the others eventually in the future?

ilayn on Mar 22, 2022

Next problem - the “universal” approximations for n!! is very bad for even numbers (the odd ones work well). For example, the formula (1) from the OP, calculated to high precision with mpmath gives: 10!! = 3063.8767134830029.

However, the true value is very different:

>>> factorial2(10)
3840.

and the current approximation actually does an extremely good job (down to a single eps even for n=100)

>>> factorial2(10) / float(factorial2(10, exact=True)) -1
0.0
>>> factorial2(100) / float(factorial2(100, exact=True)) - 1
2.220446049250313e-16
>>> float(factorial2(100, exact=True))
3.4243224702511973e+79

However, the formula picked up by @j-bowhay in #15349 seems to do a much better job also for even integers.

h-vetinari on Mar 19, 2022

While working on this, I just stumbled over test_factorial_0d_return_type, and thus https://github.com/scipy/scipy/issues/7400 & https://github.com/scipy/scipy/pull/10954. This shows that the 0d array-output for factorial2 was already undesired a long time ago, and was merely overlooked in #10954.

The reason this is relevant is that if we treat it as a bugfix (similar to #10954), then we’d have no deprecations to do for the other points in this issue.

However, I wanted to tag the participants of those old discussion, because I think #10954 went too far in fixing #7400 - I don’t think the factorial-functions should be changing array-inputs (even zero-dimensional ones) to scalars. IMO it should be exclusively:

scalar->scalar
array->array (with identical dimensions).

CC @WarrenWeckesser @person142 @rgommers @larsoner

h-vetinari on Mar 19, 2022

Maybe a good first step would be to add some benchmarks?

Absolutely! 😃

IMO we should aim basically for all the cases in the first table that don’t raise (except NaN/empty array), and some mixed inputs in the array case (though the negative inputs returning 0 are presumably less interesting).

h-vetinari on Feb 16, 2022

But to do that, we need some benchmarks against the current most important code paths (haven’t verified current status of factorial in benchmarks/), and ideally some as-tight-as-possible accuracy tests.

Agreed, this was my hesitation in making them thin wrappers as there is a potential loss in performance and accuracy as some cases have more optimal ways of performing the calculation that don’t generalise.

In benchmarks/benchmarks/special.py there are currently no benchmarks for any of the factorial stuff. Maybe a good first step would be to add some benchmarks?

j-bowhay on Feb 16, 2022

@tirthasheshpatel

Thanks for the incredibly detailed report @h-vetinari 🤯!

🙃

I suggest adding a TLDR; something like this if I understand the issue correctly:

Done, thanks!

Can this issue be split into two: one for addressing inconsistencies and one for adding the extension?

Fine by me, and effectively adopted with your TL;DR.

factorial2(<scalar>, exact=False) returns a zero-dimensional array instead of a scalar.

This is somewhat common in stats too. Although it’s an inconsistency, such outputs mostly behave like scalars (and also get coerced into one if operated with scalars). I think there are more important things we should tackle first: for example, vectorizing factorial2 and factorialk (these functions throwing for array inputs is very unintuitive).

Prioritization aside, I think we should clean this up. 0-dim arrays used to be a bit of a wild west thing, but it’s getting cleaned up in lots of APIs (not just numpy/scipy, also other downstreams like cvxpy).

As such, the exact keyword only really matters to positive scalar integers; in all other cases, it could be retired, resp. raise.

I agree that raising would be nice but that would again be a backward-incompatible change. Rather, we can add a keyword, similar to nan_policy, that asks the user whether to coerce the inputs to integers, raise, or fill NaNs for invalid cases in array inputs (e.g. exact_policy='raise' | 'coerce' | 'fill-nan'). For factorial, the default could be 'coerce'. For factorial2 and factorialk, raise.

I’m not a fan, to be honest. It has the same high price (another new keyword) for very little gain IMO. In particular, I think coerce is basically never the right thing, and if so, then people should get the warning/error and explicitly cast.

Complex extensions (together with non-positive floats) should be behind a keyword

This sounds good.

We’ll have to do some bikeshedding on this. I proposed extend=True|False (more or less as a stand-in, because extend_complex seemed too long), but if we want to leave the recurrence thing on the table, we could actually do extend='zero'|'complex'|'recurrence' (with 'zero' being the default and 'recurrence' potentially being added later), which to me looks even better than boolean switches (even if 'recurrence' never materializes)

I would propose to introduce another keyword (sorry…) legacy, defaulting to True, but implement legacy=False with all the corrected behaviour. Where there’s any planned change in behaviour, we raises a deprecation warning recommending to switch to legacy=False. After 2 releases, we switch the default from legacy=True to legacy=False, and after two more releases we remove the keyword.

I am not against this but maybe we can also consider an alternative approach like adding exact_policy?

I understand that it’s not attractive, but I had a closer look and added this to the OP:

Looking more closely at relevant changes in (currently non-raising) behaviour, it really only affects the zero-dimensional arrays for factorial2(scalar_val, exact=False). In particular, we could choose to let exact=True only raise if extend=True as well, keeping essentially all current non-raising behaviour intact.

In other words, that’s the one behavioural change that we cannot avoid anyway, and using it for that aspect is IMO a reasonable transition (and only one affected function). We could still deprecate certain combinations that are currently non-raising by deprecating them, also with IMO acceptably little impact.