pandas: Various methods don't call call __finalize__

Improve coverage of NDFrame.__finalize__

Pandas uses NDFrame.__finalize__ to propagate metadata from one NDFrame to another. This ensures that things like self.attrs and self.flags are not lost. In general we would like that any operation that accepts one or more NDFrames and returns an NDFrame should propagate metadata by calling __finalize__.

The test file at https://github.com/pandas-dev/pandas/blob/master/pandas/tests/generic/test_finalize.py attempts to be an exhaustive suite of tests for all these cases. However there are many tests currently xfailing, and there are likely many APIs not covered.

This is a meta-issue to improve the use of __finalize__. Here’s a hopefully accurate list of methods that don’t currently call finalize.

Some general comments around finalize

  1. We don’t have a good sense for what should happen to attrs when there are multiple NDFrames involved with differing attrs (e.g. in concat). The safest approach is to probably drop the attrs when they don’t match, but this will need some thought.
  2. We need to be mindful of performance. __finalize__ can be somewhat expensive so we’d like to call it exactly once per user-facing method. This can be tricky for things like DataFrame.apply which is sometimes used internally. We may need to refactor some methods to have a user-facing DataFrame.apply that calls an internal DataFrame._apply. The internal method would not call __finalize__, just the user-facing DataFrame.apply would.

If you’re interested in working on this please post a comment indicating which method you’re working on. Un-xfail the test, then update the method to pass the test. Some of these will be much more difficult to work on than others (e.g. groupby is going to be difficult). If you’re unsure whether a particular method is likely to be difficult, ask first.

  • DataFrame.__getitem__ with a scalar
  • DataFrame.eval with engine="numexpr"
  • DataFrame.duplicated
  • DataFrame.add, mul, etc. (at least for most things; some work to do on conflicts / overlapping attrs in binops)
  • DataFrame.combine, DataFrame.combine_first
  • DataFrame.update
  • DataFrame.pivot, pivot_table
  • DataFrame.stack
  • DataFrame.unstack
  • DataFrame.explode https://github.com/pandas-dev/pandas/pull/46629
  • DataFrame.melt https://github.com/pandas-dev/pandas/pull/46648
  • DataFrame.diff
  • DataFrame.applymap
  • DataFrame.append
  • DataFrame.merge
  • DataFrame.cov
  • DataFrame.corrwith
  • DataFrame.count
  • DataFrame.nunique
  • DataFrame.idxmax, idxmin
  • DataFrame.mode
  • DataFrame.quantile (scalar and list of quantiles)
  • DataFrame.isin
  • DataFrame.pop
  • DataFrame.squeeze
  • Series.abs
  • DataFrame.get
  • DataFrame.round
  • DataFrame.convert_dtypes
  • DataFrame.pct_change
  • DataFrame.transform
  • DataFrame.apply
  • DataFrame.any, sum, std, mean, etdc.
  • Series.str. operations returning a Series / DataFrame
  • Series.dt. operations returning a Series / DataFrame
  • Series.cat. operations returning a Series / DataFrame
  • All groupby operations (at least some work)
  • .iloc / .loc https://github.com/pandas-dev/pandas/pull/46101

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Reactions: 3
  • Comments: 47 (36 by maintainers)

Commits related to this issue

Most upvoted comments

Hello. I’d like to contribute to this issue. First, i chose the Series.abs function, but after running some tests, I saw that this method propagates the Series attributes without using __finalize__. I think that the numpy function np.abs creates a copy of the object before calculating the absolute value of the Series entries. Here’s an execution example:

>>> import pandas as pd
>>> s = pd.Series([-5, 4])
>>> s.attrs = {"a": 1, "z" : "b"}
>>> result = s.abs()
>>> s.attrs
{'a': 1, 'z': 'b'}
>>> result.attrs
{'a': 1, 'z': 'b'}

So I guess __finalize__ isn’t necessary here, unless I’m wrong. Now I’m working on DataFrame.pop, and will soon submit a PR.

.quantile can be ticked off as well. Done as of #47183