scipy: BUG: handle inconsistencies in factorial functions and their extensions

In the context of #15351 and reviewing #15299, I noticed a bunch of inconsistencies that should be fixed IMO.

TL;DR

Courtesy of @tirthasheshpatel:

  • factorial, factorial2, and factorialk behave inconsistently for several different combinations of inputs. (for example, factorial2 returning 0-dim arrays for scalars and, factorial2 and factorialk aren’t vectorized). See the first table for more details. This issue can be tacked by adding a new parameter like legacy which, by default, preserves the old behaviour but when false, handles the current inconsistent behavior correctly.
  • Support for float and complex inputs can easily be added using this extension. This enhancement adds a new boolean parameter extend='zero'|'complex' (and perhaps 'recurrence' down the line) that specifies whether to use the extension formula or the default formula. See the second table for more details.

Design Space

There are several dimensions that come into play:

  • scalar vs. array inputs
  • for integers: approximation or exact computation (exact=)
  • extensions to negative integers
  • extensions to floats
  • extensions to complex numbers
  • extensions to poles (possible e.g. for factorial2)
  • for the array-case, failing soft (e.g. inserting NaNs) or hard (raising error)

Current Behaviour (as of 1.8.0)

Inputs factorial(
exact=False)
factorial(
exact=True)
factorial2(
exact=False)
factorial2(
exact=True)
factorialk(
exact=False)
factorialk(
exact=True)
Scalar positive integer float int !! 0-dim
float array
!!
int NotImpl Error int
Scalar negative integer 0.0 0 !! array(0.) !! 0 NotImpl Error 0
Scalar positive float float ValueError (but text misleading) TypeError TypeError NotImpl Error TypeError
Scalar negative float 0.0 0 TypeError TypeError NotImpl Error TypeError
Scalar complex TypeError TypeError TypeError TypeError NotImpl Error TypeError
Scalar NaN NaN NaN TypeError TypeError NotImpl Error TypeError
Positive integer in array float int float !! TypeError !! NotImpl Error TypeError
Negative integer in array 0.0 0 0.0 !! TypeError !! NotImpl Error TypeError
Positive float in array float ValueError (but text misleading) TypeError TypeError NotImpl Error TypeError
Negative float in array 0.0 0 TypeError TypeError NotImpl Error TypeError
Complex in array TypeError TypeError TypeError TypeError NotImpl Error TypeError
NaN in array NaN NaN TypeError TypeError NotImpl Error TypeError
Empty array array([]) IndexError TypeError TypeError NotImpl Error TypeError

Note that:

  • factorial2(<scalar>, exact=False) returns a zero-dimensional array instead of a scalar.
  • All the TypeErrors are IMO implementation oversights (and in case of “!!”, against documentation) - i.e. we should provide better error messages.
  • There is a full generalization of factorialk which is basically just a fraction of two gamma functions, which would work for all complex n & k (up to poles), as well as arrays.

Extension Issues

In addition to this hodgepodge of behaviour, some people would like to have the float/complex extensions of these functions (e…g. https://github.com/scipy/scipy/pull/15349). I tend to agree, but that raises another problem:

Complex extensions are incompatible with current inputs ≤0

First off, everything that’s 0 currently on the negative axis gets a different value. But also note the “≤”; namely, 0!! is 1 in the regular definition, but ~0.79=sqrt(2/pi) in the extension.

Extensions based on recurrence are incompatible with current status as well as complex extension

This came up for example in #15299. It’s possible (for example) to invert the recurrence relation n!! = n * (n-2)!!, and so extend the double-factorial to negative integers. The resulting value would be neither 0 (currently) nor the pole in the complex extension.

Design Considerations

If we (or at least I) were designing from scratch:

  • Changing from the exact=False to exact=True should never fundamentally change the result (e.g. going from 0 to non-zero, or casting inputs to integers as seen in #15351).
    • As such, the exact keyword only really matters to positive scalar integers; in all other cases, it could be retired, resp. raise, or simply be turned into a no-op (cf. #15359)
    • Open question whether it’s worth to properly implement exact=True for array-case (currently, only factorial does).
  • Complex extensions (together with non-positive floats) should be behind a keyword, roughly because:
    • 0-for-negative-values is a valid approach, and has been in scipy ~forever.
    • Since the values being produced are incompatible (for certain inputs), the complex extension should be explicitly opt-in.
  • Using recurrence relations for extending into poles is an open question; not sure if worth it, but if so, probably with another ~keyword~ value for the extend keyword.
  • Use the most general approximation, except when alternatives with better accuracy / performance exist.
  • Probably need to think about dtype casting rules for empty arrays (keep dtype? always float? …)

Possible Future Behaviour (note extend=, not exact=)

Inputs factorial(
extend=
'zero')
factorial(
extend=
'complex')
factorial2(
extend=
'zero')
factorial2(
extend=
'complex')
factorialk(
extend=
'zero')
factorialk(
extend=
'complex')
Scalar positive integer w/ exact=True n! n! n!! n!! n!(k) n!(k)
Scalar positive integer Γ(n+1) Γ(n+1) (1) (1) (2) (2)
Scalar negative integer / poles 0.0 NaN 0.0 NaN 0.0 NaN
Scalar positive float Γ(n+1) Γ(n+1) (1) (1) (2) (2)
Scalar negative float 0.0 Γ(n+1) 0.0 (1) 0.0 (2)
Scalar complex ValueError Γ(n+1) ValueError (1) ValueError (2)
Scalar NaN NaN NaN NaN NaN NaN NaN
Positive integer in array Γ(n+1) Γ(n+1) (1) (1) (2) (2)
Negative integer in array 0.0 NaN 0.0 NaN 0.0 NaN
Positive float in array Γ(n+1) Γ(n+1) (1) (1) (2) (2)
Negative float in array 0.0 Γ(n+1) 0.0 (1) 0.0 (2)
Complex in array ValueError Γ(n+1) ValueError (1) ValueError (2)
NaN in array NaN NaN NaN NaN NaN NaN
Empty array array([]) array([]) array([]) array([]) array([]) array([])

The formulas in question are:

  • (1): z!! = 2^((z-1)/2) Γ(z/2 + 1) / Γ(1/2 + 1)
  • (2): z!(α) = α^((z-1)/α) Γ(z/α + 1) / Γ(1/α + 1)

I’d very much welcome feedback on this - there are some trade-offs to be made (e.g. whether negative floats should also raises without extend='complex' or whether the errors in the array-case should just map to NaN; also there are different approximations for n!! in use currently), but I think the above would be reasonable. In particular, the scalar and array case behave the same way.

Implementation Plan

I would propose to introduce another keyword (sorry…) legacy, defaulting to True, but implement legacy=False with all the corrected behaviour. Where there’s any planned change in behaviour, we raises a deprecation warning recommending to switch to legacy=False. After 2 releases, we switch the default from legacy=True to legacy=False, after two more releases we raise on legacy=True, and after two more we remove the keyword.

Looking more closely at relevant changes in (currently non-raising) behaviour, it really only affects the zero-dimensional arrays for factorial2(scalar_val, exact=False). In particular, if we let exact=True be a no-op for non-integers (the less “noisy” options from the point of UX), we would keep essentially all current non-raising behaviour intact.

CC @steppi @tirthasheshpatel @j-bowhay @tupui

Edit History:

  • v4 (2022-03-10): Make exact=True a no-op for non-integers (less noisy), following #15359. Extend deprecation period of legacy-keyword along the same lines as proposed in https://github.com/scipy/scipy/issues/15722.
  • v3 (2022-02-16; up to rev 20): Added (current & proposed) behaviour for NaN and empty arrays; good catch @tirthasheshpatel!
  • v2 (2022-02-16; up to rev 15): Added TL;DR, clarification on impact of legacy, changed extend from boolean argument to take an enum.
  • v1 (2022-02-16; up to rev 10): Original

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 3
  • Comments: 32 (32 by maintainers)

Most upvoted comments

Thanks for the incredibly detailed report @h-vetinari 🤯! I suggest adding a TLDR; something like this if I understand the issue correctly:

  1. factorial, factorial2, and factorialk behave inconsistently for several different combinations of inputs. (for example, factorial2 returning 0-dim arrays for scalars and, factorial2 and factorialk aren’t vectorized). See the first table for more details. This issue can be tacked by adding a new parameter like legacy which, by default, preserves the old behaviour but when false, handles the current inconsistent behavior correctly.
  2. Support for float and complex inputs can easily be added using this extension. This enhancement adds a new boolean parameter extend that specifies whether to use the extension formula or the default formula. See the second table for more details.

Can this issue be split into two: one for addressing inconsistencies and one for adding the extension?

  • factorial2(<scalar>, exact=False) returns a zero-dimensional array instead of a scalar.

This is somewhat common in stats too. Although it’s an inconsistency, such outputs mostly behave like scalars (and also get coerced into one if operated with scalars). I think there are more important things we should tackle first: for example, vectorizing factorial2 and factorialk (these functions throwing for array inputs is very unintuitive).

  • As such, the exact keyword only really matters to positive scalar integers; in all other cases, it could be retired, resp. raise.

I agree that raising would be nice but that would again be a backward-incompatible change. Rather, we can add a keyword, similar to nan_policy, that asks the user whether to coerce the inputs to integers, raise, or fill NaNs for invalid cases in array inputs (e.g. exact_policy='raise' | 'coerce' | 'fill-nan'). For factorial, the default could be 'coerce'. For factorial2 and factorialk, raise.

  • Complex extensions (together with non-positive floats) should be behind a keyword

This sounds good.

I would propose to introduce another keyword (sorry…) legacy, defaulting to True, but implement legacy=False with all the corrected behaviour. Where there’s any planned change in behaviour, we raises a deprecation warning recommending to switch to legacy=False. After 2 releases, we switch the default from legacy=True to legacy=False, and after two more releases we remove the keyword.

I am not against this but maybe we can also consider an alternative approach like adding exact_policy?

I consider this issue solved by #15841. For the open points regarding the extensions, I’ve opened a follow-up issue: https://github.com/scipy/scipy/issues/18409

Almost in every software I used, if we have a multiple function doing similar but different versions of a functionality 5 years down the line we struggle deprecating/removing them.

I would have liked this function to be in NumPy namespace since it is a very common math operation or from standard math library but it is now living in special functions territory because reasons.

So at least can we combine them into a single signature since they are special cases of k=1, 2 and get rid of the others eventually in the future?

Next problem - the “universal” approximations for n!! is very bad for even numbers (the odd ones work well). For example, the formula (1) from the OP, calculated to high precision with mpmath gives: 10!! = 3063.8767134830029.

However, the true value is very different:

>>> factorial2(10)
3840.

and the current approximation actually does an extremely good job (down to a single eps even for n=100)

>>> factorial2(10) / float(factorial2(10, exact=True)) -1
0.0
>>> factorial2(100) / float(factorial2(100, exact=True)) - 1
2.220446049250313e-16
>>> float(factorial2(100, exact=True))
3.4243224702511973e+79

However, the formula picked up by @j-bowhay in #15349 seems to do a much better job also for even integers.

While working on this, I just stumbled over test_factorial_0d_return_type, and thus https://github.com/scipy/scipy/issues/7400 & https://github.com/scipy/scipy/pull/10954. This shows that the 0d array-output for factorial2 was already undesired a long time ago, and was merely overlooked in #10954.

The reason this is relevant is that if we treat it as a bugfix (similar to #10954), then we’d have no deprecations to do for the other points in this issue.

However, I wanted to tag the participants of those old discussion, because I think #10954 went too far in fixing #7400 - I don’t think the factorial-functions should be changing array-inputs (even zero-dimensional ones) to scalars. IMO it should be exclusively:

  • scalar->scalar
  • array->array (with identical dimensions).

CC @WarrenWeckesser @person142 @rgommers @larsoner

Maybe a good first step would be to add some benchmarks?

Absolutely! 😃

IMO we should aim basically for all the cases in the first table that don’t raise (except NaN/empty array), and some mixed inputs in the array case (though the negative inputs returning 0 are presumably less interesting).

But to do that, we need some benchmarks against the current most important code paths (haven’t verified current status of factorial in benchmarks/), and ideally some as-tight-as-possible accuracy tests.

Agreed, this was my hesitation in making them thin wrappers as there is a potential loss in performance and accuracy as some cases have more optimal ways of performing the calculation that don’t generalise.

In benchmarks/benchmarks/special.py there are currently no benchmarks for any of the factorial stuff. Maybe a good first step would be to add some benchmarks?

@tirthasheshpatel

Thanks for the incredibly detailed report @h-vetinari 🤯!

🙃

I suggest adding a TLDR; something like this if I understand the issue correctly:

Done, thanks!

Can this issue be split into two: one for addressing inconsistencies and one for adding the extension?

Fine by me, and effectively adopted with your TL;DR.

  • factorial2(<scalar>, exact=False) returns a zero-dimensional array instead of a scalar.

This is somewhat common in stats too. Although it’s an inconsistency, such outputs mostly behave like scalars (and also get coerced into one if operated with scalars). I think there are more important things we should tackle first: for example, vectorizing factorial2 and factorialk (these functions throwing for array inputs is very unintuitive).

Prioritization aside, I think we should clean this up. 0-dim arrays used to be a bit of a wild west thing, but it’s getting cleaned up in lots of APIs (not just numpy/scipy, also other downstreams like cvxpy).

  • As such, the exact keyword only really matters to positive scalar integers; in all other cases, it could be retired, resp. raise.

I agree that raising would be nice but that would again be a backward-incompatible change. Rather, we can add a keyword, similar to nan_policy, that asks the user whether to coerce the inputs to integers, raise, or fill NaNs for invalid cases in array inputs (e.g. exact_policy='raise' | 'coerce' | 'fill-nan'). For factorial, the default could be 'coerce'. For factorial2 and factorialk, raise.

I’m not a fan, to be honest. It has the same high price (another new keyword) for very little gain IMO. In particular, I think coerce is basically never the right thing, and if so, then people should get the warning/error and explicitly cast.

  • Complex extensions (together with non-positive floats) should be behind a keyword

This sounds good.

We’ll have to do some bikeshedding on this. I proposed extend=True|False (more or less as a stand-in, because extend_complex seemed too long), but if we want to leave the recurrence thing on the table, we could actually do extend='zero'|'complex'|'recurrence' (with 'zero' being the default and 'recurrence' potentially being added later), which to me looks even better than boolean switches (even if 'recurrence' never materializes)

I would propose to introduce another keyword (sorry…) legacy, defaulting to True, but implement legacy=False with all the corrected behaviour. Where there’s any planned change in behaviour, we raises a deprecation warning recommending to switch to legacy=False. After 2 releases, we switch the default from legacy=True to legacy=False, and after two more releases we remove the keyword.

I am not against this but maybe we can also consider an alternative approach like adding exact_policy?

I understand that it’s not attractive, but I had a closer look and added this to the OP:

Looking more closely at relevant changes in (currently non-raising) behaviour, it really only affects the zero-dimensional arrays for factorial2(scalar_val, exact=False). In particular, we could choose to let exact=True only raise if extend=True as well, keeping essentially all current non-raising behaviour intact.

In other words, that’s the one behavioural change that we cannot avoid anyway, and using it for that aspect is IMO a reasonable transition (and only one affected function). We could still deprecate certain combinations that are currently non-raising by deprecating them, also with IMO acceptably little impact.