pandas: PERF: regression in reindex. Pandas 0.23.4 is 60x slower than 0.22.0 with a MultiIndex with datetime64
Re-indexing a series with a two-level MultiIndex where the first level is datetime64 and the second level is int is 40x slower than in 0.22.0. Output first then repro code below. The issue persists if you change the first level to int instead of datetime, but the perf difference is less (0.40 seconds vs 0.03 seconds).
"""
pandas version: 0.23.4
reindex took 1.9770500659942627 seconds
pandas version: 0.22.0
reindex took 0.0306899547577 seconds
"""
import pandas as pd
import time
import numpy as np
if __name__ == '__main__':
n_days = 300
dr = pd.date_range(end="20181118", periods=n_days)
mi = pd.MultiIndex.from_product([dr, range(1440)])
v = np.random.randn(len(mi))
mask = np.random.rand(len(v)) < .3
v[mask] = np.nan
s = pd.Series(v, index=mi)
s = s.sort_index()
s2 = s.dropna()
start = time.time()
s2.reindex(index=s.index)
end = time.time()
print("pandas version: %s" % pd.__version__)
print("reindex took %s seconds" % (end - start))
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 3
- Comments: 17 (8 by maintainers)
@toobaz I did some further investigation and find out that
_extract_level_codes
is the other major cause of the performance regression.A minimal example:
The result :
So
extract_level_codes
almost takes the same amount of time asget_indexer_for
! To be more specific, it is the following line which is causing the performance regression: https://github.com/pandas-dev/pandas/blob/fd67546153ac6a5685d1c7c4d8582ed1a4c9120f/pandas/_libs/index.pyx#L602So the conlusion is that:
extract_level_codes
contributes to other 10X speed difference.This issue still persists with the latest 1.2.3 version and reindexing seems to get even slower.
@TomAugspurger There is still a perf regression with just integer levels but it’s to a smaller degree. With DatetimeIndex the perf regression was from 0.03 seconds to 1.97 seconds. With just integer levels the perf regression is from 0.03 seconds seconds to 0.40 seconds, so a slowdown of 10x instead of 60x.
When I change the dr variable to
I get the following output: pandas version: 0.23.4 reindex took 0.41175103187561035 seconds
pandas version: 0.22.0 reindex took 0.0308949947357 seconds