pandas: Reindex broken

MWE

from __future__ import print_function

import pandas as pd
import numpy as np

print("Panda version:", pd.__version__)
print("+++++++++++++++++++++++++++++++++++")
print(pd.show_versions())
print("+++++++++++++++++++++++++++++++++++")

####################################################
# Config
####################################################

pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format

####################################################
# Read data
####################################################

file = "/tmp/california_housing_train.csv"
if(np.DataSource().exists(file)):
	dataset = file
else:
	dataset = "https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv"

sep=","
california_housing_dataframe = pd.read_csv(dataset, sep)

####################################################
# Reorder
####################################################

newOrder = np.random.permutation(california_housing_dataframe.index)
california_housing_dataframe_reordered = california_housing_dataframe.reindex(newOrder)

####################################################
# Merge and show diff of the heads
####################################################

# Let's take the heads of both datasetstand compare them
# They should be different in (mostly) all elements 

head1 = california_housing_dataframe.head(10)
head2 = california_housing_dataframe_reordered.head(10)

# @see https://stackoverflow.com/a/36893675/605890
merged = head1.merge(head2, indicator=True, how='outer')
print(merged)

Run on colab

I created a colab for the MWE, which is based on pandas 0.22.0:

https://colab.research.google.com/drive/19uDE_H4AtpLaEL6INrRrDMXkdANsNr69#scrollTo=CzxuGppV26Rt

If you run it, you see at the output (if non is doubled randomly):

  • 10x left_only
  • 10x right_only

Run with docker containers

Now, run the same MWE (located under /tmp/tf/Bug.py) in a two different docker containers, which uses pandas 0.23.4,:

Both return:

  • 10x both

This means, both heads are the same, which means that reindex does not have any effect.

Python docker container (python 3.6.6)

docker run --rm -it -v /tmp/tf/:/tmp/ python:3.6.6 /bin/bash -c "pip install pandas && python /tmp/Bug.py"

tensorflow docker container (tensorsflow 1.11.0)

docker run --rm -it -v /tmp/tf/:/tmp/ tensorflow/tensorflow:1.11.0-py3 python /tmp/Bug.py 

TLDR

The following code does not have any effect in pandas 0.23.4:

california_housing_dataframe_reordered = california_housing_dataframe.reindex(newOrder)

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 16 (8 by maintainers)

Most upvoted comments

This looks to be the root cause of the issue: https://github.com/numpy/numpy/issues/11975, which should be fixed in the next numpy release (1.15.3).

In [1]: import numpy as np; np.__version__
Out[1]: '1.15.2'

In [2]: import pandas as pd; pd.__version__
Out[2]: '0.23.4'

In [3]: df = pd.DataFrame({'a': range(5), 'b': np.arange(0, 0.5, 0.1)})

In [4]: df
Out[4]:
   a    b
0  0  0.0
1  1  0.1
2  2  0.2
3  3  0.3
4  4  0.4

In [5]: new_order = np.random.permutation(df.index)

In [6]: df
Out[6]:
   a    b
1  0  0.0
2  1  0.1
0  2  0.2
3  3  0.3
4  4  0.4

The workaround of using df.index.values as suggested in the issue appears to work:

In [7]: df = pd.DataFrame({'a': range(5), 'b': np.arange(0, 0.5, 0.1)})

In [8]: df
Out[8]:
   a    b
0  0  0.0
1  1  0.1
2  2  0.2
3  3  0.3
4  4  0.4

In [9]: new_order = np.random.permutation(df.index.values)

In [10]: df
Out[10]:
   a    b
0  0  0.0
1  1  0.1
2  2  0.2
3  3  0.3
4  4  0.4

In [11]: df.reindex(new_order)
Out[11]:
   a    b
0  0  0.0
4  4  0.4
3  3  0.3
2  2  0.2
1  1  0.1

My code:

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': range(5), 'b': np.arange(0, 0.5, 0.1)})
print(df)

print('++++++++++++++')

new_order = np.random.permutation(df.index)
print(df.reindex(new_order))

and the result from the tensorflow container:

   a    b
0  0  0.0
1  1  0.1
2  2  0.2
3  3  0.3
4  4  0.4
++++++++++++++
   a    b
4  0  0.0
3  1  0.1
1  2  0.2
2  3  0.3
0  4  0.4

But it should be the following for the reindex, right?

   a    b
4  4  0.4
3  3  0.3
1  1  0.1
2  2  0.2
0  0  0.0