pandas: BUG: Issue with Pandas df.str.split with expand

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example


>>> df=pd.DataFrame({'a':['Kickoff_2'],'b':['bla'],'c':[1],'x':[None],'y':[None],'z':[None]})
>>> df[['x','y','z']] = df['a'].str.split('_',expand=True,n=2)
...
    raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key

>>> df_long=pd.DataFrame({'a':['Kickoff_2','Kickoff','Kickoff_pass_jump'],'b':['bla','bla','bla'],'c':[1,2,3],'x':[None,None,None],'y':[None,None,None],'z':[None,None,None]})
>>> df_long[['x','y','z']] = df_long['a'].str.split('_',expand=True,n=2)
>>> df_long
                   a    b  c        x     y     z
0          Kickoff_2  bla  1  Kickoff     2  None
1            Kickoff  bla  2  Kickoff  None  None
2  Kickoff_pass_jump  bla  3  Kickoff  pass  jump

Problem description

String + split() with expand only works if there is a row in the df that will later contain the desired number of columns. It would be nice if the first example here would get filled up with NaNs.

Expected Output

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Comments: 20 (11 by maintainers)

Most upvoted comments

Just tried your suggested workaround. It results in an error (using Pandas 1.1.1 on the computer I am at right now): TypeError: ffill() got an unexpected keyword argument "columns"

I think for now, the cleanest workaround would be to append a row to the df which contains a string with enough separators in the hot column and then delete that row later again. I will try that out tomorrow.

The thing about the example of this bug is that if you want to split the strings in a column at an underscore and expand it into there columns, it will only work depending on if there is one row where you have a string that contains at least two underscores. So if you have a string something like “part01_part02_part03” in at least one row, Pandas can expand the column into three new ones, if there is not a single row with these two underscores but with less, it does not work. Why this behavior can be interpreted as a documentation issue is a bit difficult to understand. I have tried fixing the code, but what Pandas does with the keyword “expand” is even harder to understand.

It would be super nice to have the possibility to enforce the number of return values.

or maybe another DataFrame method that allows padding rows and columns something similar to np.pad https://numpy.org/doc/stable/reference/generated/numpy.pad.html

this would have more utility, and avoid a breaking API change on df.str.split(expand=True, n=…)

There is already an DataFrame.pad method whose semantics are different enough from NumPy that the DataFrame.pad should proably be deprecated. It is only an alias to DataFrame.fillna(method='ffill') after all.

Another future possibility is that using __array_function__ we may be able pad a DataFrame to a prescribed number of columns in such a manner using the NumPy api.

of course, this doesn’t address the return value of df.str.split directly, but the same would be achieved with an extra step.

Using the expand and n arguments in the pd.str.split() sounds like it should always return something of length n+1. Also it is pretty confusing that in this example the split only works when the last row of the example df is present. If expand together with n, would fill up with NaN values, everything would work, independent what what follows in the df.