pandas: DataFrame.copy(deep=True) is not a deep copy of the index

Code Sample, a copy-pastable example if possible

df1 = pd.DataFrame(index=['a', 'b'], columns=['foo', 'muu'])
df1.index.name = "foo"
print(df1)

# create deep copy of df1 and change a value in the index
df2 = df1.copy(deep=True)
df2.index.name = "bar"
df2.index.values[0] = 'c'  # changes both df1 and df2

print(df1)
print(df2)

Problem description

DataFrame.copy(deep=True) is not a deep copy of the index.

In

https://github.com/pandas-dev/pandas/blob/a00154dcfe5057cb3fd86653172e74b6893e337d/pandas/core/indexes/base.py#L787

maybe deep should be set to True?

Expected Output

     foo  muu
foo          
a    NaN  NaN
b    NaN  NaN
     foo  muu
foo          
c    NaN  NaN
b    NaN  NaN
     foo  muu
bar          
c    NaN  NaN
b    NaN  NaN

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None python: 3.6.3.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-53-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.21.0 pytest: 3.2.1 pip: 9.0.1 setuptools: 36.5.0.post20170921 Cython: 0.26.1 numpy: 1.13.1 scipy: 0.19.1 pyarrow: 0.8.0 xarray: 0.9.6 IPython: 6.1.0 sphinx: 1.6.3 patsy: 0.4.1 dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.4.8 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: 0.9.8 lxml: 3.8.0 bs4: 4.6.0 html5lib: 0.999999999 sqlalchemy: 1.1.13 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: 0.5.0

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Reactions: 4
  • Comments: 20 (12 by maintainers)

Most upvoted comments

IMO, copy(deep=True) should completely sever all connections between the original and the copied object - compare the official python docs (https://docs.python.org/3/library/copy.html):

A deep copy constructs a new compound object and then, recursively, inserts copies into it of the objects found in the original.

So, IMO, deep=True should come to mean what deep='all' does currently (and the latter can then be removed).

Re:

Indexes are immutable. Changing its underlying data is going to cause all sorts of problems.

This is not a valid argument IMO - it’s up to me as a user (consenting adults and all…) what I do with my objects, including the indexes, and if I make a deep copy, it’s a justified expectation (I would even argue: a built-in expectation of the word “deep”) that this will not mess with the original.

Plus, if I’m already deep-copying the much larger values of a DF, not copying the index only saves a comparatively irrelevant amount of memory.

ok. I think the documentation of copy is unclear then: Make a deep copy, including a copy of the data and the indices.

I also ran into this today, discovered that even if the id of the index was different on the copy, modifying the cp.index.to_numpy() values was corrupting the original.

I am totally in line @DanielGoldfarb 's point 1:

I believe all of the few languages with which I am familiar (including python), adhere to the convention that “deep” means a complete severance of all connections between the original and the copied object. Imho it’s therefore difficult to justify that deep=True should be anything other than deep=‘all’.

A fix for this could be composed of the following elements:

  • have the deep=True behave (and be documented) as the intuition, that is, with absolutely no shared items between the copy and the original. The deep='all' alias can stay around, but if it is not yet official maybe it should better be dropped now.

  • accept a new deep='values' where only the values are deep-copied. This is therefore the same behaviour as today’s deep=True. Make this the default to preserve legacy compatibility and speed.

  • optionally accept a new deep=index where only the index is deep-copied. I would not really know why this would be needed, but this is just for symmetry of the API

Would this be ok for everyone ?

The example looks to work on master. Could use a test

In [38]: df1 = pd.DataFrame(index=['a', 'b'], columns=['foo', 'muu'])
    ...: df1.index.name = "foo"
    ...: print(df1)
    ...:
    ...: # create deep copy of df1 and change a value in the index
    ...: df2 = df1.copy(deep=True)
    ...: df2.index.name = "bar"
    ...: df2.index.values[0] = 'c'  # changes both df1 and df2
    ...:
    ...: print(df1)
    ...: print(df2)
     foo  muu
foo
a    NaN  NaN
b    NaN  NaN
     foo  muu
foo
c    NaN  NaN
b    NaN  NaN
     foo  muu
bar
c    NaN  NaN
b    NaN  NaN

In [39]: pd.__version__
Out[39]: '1.1.0.dev0+1216.gd4d58f960'