pandas: PERF: regression in reindex. Pandas 0.23.4 is 60x slower than 0.22.0 with a MultiIndex with datetime64

Re-indexing a series with a two-level MultiIndex where the first level is datetime64 and the second level is int is 40x slower than in 0.22.0. Output first then repro code below. The issue persists if you change the first level to int instead of datetime, but the perf difference is less (0.40 seconds vs 0.03 seconds).


"""
pandas version: 0.23.4
reindex took 1.9770500659942627 seconds

pandas version: 0.22.0
reindex took 0.0306899547577 seconds
"""


import pandas as pd
import time
import numpy as np


if __name__ == '__main__':
    n_days = 300
    dr = pd.date_range(end="20181118", periods=n_days)
    mi = pd.MultiIndex.from_product([dr, range(1440)])

    v = np.random.randn(len(mi))
    mask = np.random.rand(len(v)) < .3
    v[mask] = np.nan
    s = pd.Series(v, index=mi)
    s = s.sort_index()

    s2 = s.dropna()

    start = time.time()

    s2.reindex(index=s.index)

    end = time.time()
    print("pandas version: %s" % pd.__version__)
    print("reindex took %s seconds" % (end - start))

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 3
  • Comments: 17 (8 by maintainers)

Most upvoted comments

@toobaz I did some further investigation and find out that _extract_level_codes is the other major cause of the performance regression.

A minimal example:

import pandas as pd
import time
import numpy as np


if __name__ == '__main__':
    n_days = 2500
    dr = pd.date_range(end="20120101", periods=n_days)
    mi = pd.MultiIndex.from_product([dr, range(1440)])

    v = np.random.randn(len(mi))
    mask = np.random.rand(len(v)) < .3
    v[mask] = np.nan
    s = pd.Series(v, index=mi)
    s = s.sort_index()

    s2 = s.dropna()

    start = time.time()


    match_seq = s2.index.get_indexer_for(s.index)
    # result = s2.index.values
    # resul2 = s.index.values

    end1 = time.time()
    match_seq = s2.index.get_indexer_for(s.index)
    end2 = time.time()

    s2.index._engine._extract_level_codes(s.index)
    end3 = time.time()
    # print(s2)
    print("pandas version: %s" % pd.__version__)
    print("reindex for the first time(include time cost of populating the hash mapping) took %s seconds" % (end1 - start))
    print("reindex with mapping populated took %s seconds" % (end2 - end1))
    print("extrace level codes takes %s seconds" % (end3 - end2))

The result :

pandas version: 1.1.4
reindex for the first time(include of time cost of populating the hash mapping) took 5.976720809936523 seconds
reindex with mapping populated took 3.858426332473755 seconds
extract level codes takes 3.790076732635498 seconds

So extract_level_codes almost takes the same amount of time as get_indexer_for! To be more specific, it is the following line which is causing the performance regression: https://github.com/pandas-dev/pandas/blob/fd67546153ac6a5685d1c7c4d8582ed1a4c9120f/pandas/_libs/index.pyx#L602

So the conlusion is that:

  1. conversion from datatime to object contributes to the 6X speed difference (This can be easily fixed if ed15d8e can be reverted)
  2. inefficiency of extract_level_codes contributes to other 10X speed difference.

This issue still persists with the latest 1.2.3 version and reindexing seems to get even slower.

pandas version: 1.2.3
reindex took 2.6638526916503906 seconds

@TomAugspurger There is still a perf regression with just integer levels but it’s to a smaller degree. With DatetimeIndex the perf regression was from 0.03 seconds to 1.97 seconds. With just integer levels the perf regression is from 0.03 seconds seconds to 0.40 seconds, so a slowdown of 10x instead of 60x.

When I change the dr variable to

dr = range(300) # instead of date_range(...)

I get the following output: pandas version: 0.23.4 reindex took 0.41175103187561035 seconds

pandas version: 0.22.0 reindex took 0.0308949947357 seconds