pandas: Reindex broken
MWE
from __future__ import print_function
import pandas as pd
import numpy as np
print("Panda version:", pd.__version__)
print("+++++++++++++++++++++++++++++++++++")
print(pd.show_versions())
print("+++++++++++++++++++++++++++++++++++")
####################################################
# Config
####################################################
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format
####################################################
# Read data
####################################################
file = "/tmp/california_housing_train.csv"
if(np.DataSource().exists(file)):
dataset = file
else:
dataset = "https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv"
sep=","
california_housing_dataframe = pd.read_csv(dataset, sep)
####################################################
# Reorder
####################################################
newOrder = np.random.permutation(california_housing_dataframe.index)
california_housing_dataframe_reordered = california_housing_dataframe.reindex(newOrder)
####################################################
# Merge and show diff of the heads
####################################################
# Let's take the heads of both datasetstand compare them
# They should be different in (mostly) all elements
head1 = california_housing_dataframe.head(10)
head2 = california_housing_dataframe_reordered.head(10)
# @see https://stackoverflow.com/a/36893675/605890
merged = head1.merge(head2, indicator=True, how='outer')
print(merged)
Run on colab
I created a colab for the MWE, which is based on pandas 0.22.0:
https://colab.research.google.com/drive/19uDE_H4AtpLaEL6INrRrDMXkdANsNr69#scrollTo=CzxuGppV26Rt
If you run it, you see at the output (if non is doubled randomly):
- 10x
left_only
- 10x
right_only
Run with docker containers
Now, run the same MWE (located under /tmp/tf/Bug.py
) in a two different docker containers, which uses pandas 0.23.4,:
Both return:
- 10x
both
This means, both heads are the same, which means that reindex
does not have any effect.
Python docker container (python 3.6.6)
docker run --rm -it -v /tmp/tf/:/tmp/ python:3.6.6 /bin/bash -c "pip install pandas && python /tmp/Bug.py"
tensorflow docker container (tensorsflow 1.11.0)
docker run --rm -it -v /tmp/tf/:/tmp/ tensorflow/tensorflow:1.11.0-py3 python /tmp/Bug.py
TLDR
The following code does not have any effect in pandas 0.23.4:
california_housing_dataframe_reordered = california_housing_dataframe.reindex(newOrder)
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 16 (8 by maintainers)
This looks to be the root cause of the issue: https://github.com/numpy/numpy/issues/11975, which should be fixed in the next numpy release (1.15.3).
The workaround of using
df.index.values
as suggested in the issue appears to work:My code:
and the result from the tensorflow container:
But it should be the following for the reindex, right?