pandas: BUG: specifying fill_value in pandas.DataFrame.shift() messes with index of empty dataframes
-
[ x] I have checked that this issue has not already been reported.
-
[ x] I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
import pandas as pd
# create empty df, and set a multi index
empty_df = pd.DataFrame(columns=["a", "b", "c"])
multi_index_empty_df = empty_df.set_index(keys=["a", "b"])
# index is different when applying shift on grouping when specifying a fill value
assert multi_index_empty_df.groupby(["a","b"]).shift(1).index.names == multi_index_empty_df.groupby(["a","b"]).shift(1, fill_value=0).index.names
Problem description
I discovered this problem when a unit test failed on a function that performed a groupby and shift on an empty dataframe.
Specifying fill_value
in pandas.DataFrame.shift() should not alter the index that was set on a dataframe. This is not the case for empty dataframes with a multi-index, as the example above shows.
Expected Output
When executing:
multi_index_empty_df.groupby(["a","b"]).shift(1).index.names
I get the output:
FrozenList(['a', 'b'])
But when executing
multi_index_empty_df.groupby(["a","b"]).shift(1, fill_value=0).index.names
I should get the same output, but instead I get
FrozenList([None])
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None python : 3.7.1.final.0 python-bits : 64 OS : Darwin OS-release : 18.7.0 machine : x86_64 processor : i386 byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 1.0.5 numpy : 1.19.0 pytz : 2018.7 dateutil : 2.7.5 pip : 18.1 setuptools : 40.6.3 Cython : 0.29.2 pytest : 4.0.2 hypothesis : None sphinx : 1.8.2 blosc : None feather : None xlsxwriter : 1.1.2 lxml.etree : 4.2.5 html5lib : 1.0.1 pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.19.0 pandas_datareader: None bs4 : 4.6.3 bottleneck : 1.2.1 fastparquet : None gcsfs : None lxml.etree : 4.2.5 matplotlib : 3.2.2 numexpr : 2.6.8 odfpy : None openpyxl : 2.5.12 pandas_gbq : None pyarrow : None pytables : None pytest : 4.0.2 pyxlsb : None s3fs : None scipy : 1.5.1 sqlalchemy : 1.2.15 tables : 3.4.4 tabulate : None xarray : None xlrd : 1.2.0 xlwt : 1.3.0 xlsxwriter : 1.1.2 numba : 0.41.0
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 17 (7 by maintainers)
I did not create a unittest. I just tested in a notebook and since the problem was nonexistent didn’t go further. I will create a unittes, submit it and link it here.
Thanks for the report @brunocous. This works on master (and latest pandas version (1.2.4)), but could probably use a test.