pandas: BUG: Indexes still include values that have been deleted

Using pandas 0.10. If we create a Dataframe with a multi-index, then delete all the rows with value X, we’d expect the index to no longer show value X. But it does. Note the apparent inconsistency between “index” and “index.levels” – one shows the values have been deleted but the other doesn’t.

import pandas

x = pandas.DataFrame([['deleteMe',1, 9],['keepMe',2, 9],['keepMeToo',3, 9]], columns=['first','second', 'third'])
x = x.set_index(['first','second'], drop=False)

x = x[x['first'] != 'deleteMe'] #Chop off all the 'deleteMe' rows

print x.index #Good: Index no longer has any rows with 'deleteMe'. But....

print x.index.levels #Bad: index still shows the "deleteMe" values are there. But why? We deleted them.

x.groupby(level='first').sum() #Bad: it's creating a dummy row for the rows we deleted!

We don’t want the deleted values to show up in that groupby. Can we eliminate them?

About this issue

  • Original URL
  • State: closed
  • Created 11 years ago
  • Comments: 35 (24 by maintainers)

Commits related to this issue

Most upvoted comments

I think this can be closed: the default behavior is as intended, and the method MultiIndex.remove_unused_levels() has been added as a simple fix for whoever doesn’t like the default behavior.

The pandas API doesn’t fit in my head anymore. For reference df.index.get_level_values might be relevent for whatever use case this was a problem for. DOes the right thing.

    ...: 
    ...: x = pandas.DataFrame([['deleteMe',1, 9],['keepMe',2, 9],['keepMeToo',3, 9]], columns=['first','second', 'third'])
    ...: x = x.set_index(['first','second'], drop=False)
    ...: 
    ...: print x.index.get_level_values(0)
    ...: x = x[x['first'] != 'deleteMe'] #Chop off all the 'deleteMe' rows
    ...: print x.index.get_level_values(0)
    ...: 
Index([u'deleteMe', u'keepMe', u'keepMeToo'], dtype='object')
Index([u'keepMe', u'keepMeToo'], dtype='object')

@robertmuil

sorry, forgot to respond to you.

Here is an easy way to do this

create the new frame (FYI in general doing things inplace IMHO is confusing to the user and doesn’t help with speed at all)

In [43]: dropped = x.drop('deleteMe', level='first')

In [44]: dropped
Out[44]: 
                  third
first     second       
keepMe    2           9
keepMeToo 3           9

In [46]: dropped.index.get_level_values(level='first').unique()
Out[46]: array(['keepMe', 'keepMeToo'], dtype=object)

This returns a new frame (You can assign alternatively if you want)

In [47]: dropped.set_index(pd.MultiIndex.from_tuples(dropped.index.values))
Out[47]: 
             third
keepMe    2      9
keepMeToo 3      9
In [50]: dropped.set_index(pd.MultiIndex.from_tuples(dropped.index.values)).index.get_level_values(level=0).unique()
Out[50]: array(['keepMe', 'keepMeToo'], dtype=object)

# has only the new values
In [51]: dropped.set_index(pd.MultiIndex.from_tuples(dropped.index.values)).index.levels[0]                         
Out[51]: Index([u'keepMe', u'keepMeToo'], dtype='object')

This is pretty cheap to do (though not completely free).

I suppose you could add this as an option to drop if you’d like. (and I would say reindex would be a fine kw for this).

like to do a pull-request?