pandas: BUG: concat unwantedly sorts DataFrame column names if they differ
When concat’ing DataFrames, the column names get alphanumerically sorted if there are any differences between them. If they’re identical across DataFrames, they don’t get sorted. This sort is undocumented and unwanted. Certainly the default behavior should be no-sort. EDIT: the standard order as in SQL would be: columns from df1 (same order as in df1), columns (uniquely) from df2 (less the common columns) (same order as in df2). Example:
df4a = DataFrame(columns=['C','B','D','A'], data=np.random.randn(3,4))
df4b = DataFrame(columns=['C','B','D','A'], data=np.random.randn(3,4))
df5 = DataFrame(columns=['C','B','E','D','A'], data=np.random.randn(3,5))
print "Cols unsorted:", concat([df4a,df4b])
# Cols unsorted: C B D A
print "Cols sorted", concat([df4a,df5])
# Cols sorted A B C D E
``'
About this issue
- Original URL
- State: closed
- Created 11 years ago
- Reactions: 9
- Comments: 36 (11 by maintainers)
Links to this issue
Commits related to this issue
- Stop concat from attempting from sorting result columns Preserve column order upon concatenation to obey least astonishment principle. Allow old behavior to be enabled by adding a boolean switch to ... — committed to brycepg/pandas by brycepg 6 years ago
- Stop concat from attempting to sort mismatched columns by default Preserve column order upon concatenation to obey least astonishment principle. Allow old behavior to be enabled by adding a boolean ... — committed to brycepg/pandas by brycepg 6 years ago
- Stop concat from attempting to sort mismatched columns by default Preserve column order upon concatenation to obey least astonishment principle. Allow old behavior to be enabled by adding a boolean ... — committed to brycepg/pandas by brycepg 6 years ago
- Stop concat from attempting to sort mismatched columns by default Preserve column order upon concatenation to obey least astonishment principle. Allow old behavior to be enabled by adding a boolean ... — committed to brycepg/pandas by brycepg 6 years ago
- Stop concat from attempting to sort mismatched columns by default Preserve column order upon concatenation to obey least astonishment principle. Allow old behavior to be enabled by adding a boolean ... — committed to brycepg/pandas by brycepg 6 years ago
- Stop concat from attempting to sort mismatched columns by default Preserve column order upon concatenation to obey least astonishment principle. Allow old behavior to be enabled by adding a boolean ... — committed to brycepg/pandas by brycepg 6 years ago
- Stop concat from attempting to sort mismatched columns by default Preserve column order upon concatenation to obey least astonishment principle. Allow old behavior to be enabled by adding a boolean ... — committed to brycepg/pandas by brycepg 6 years ago
- Stop concat from attempting to sort mismatched columns by default Preserve column order upon concatenation to obey least astonishment principle. Allow old behavior to be enabled by adding a boolean ... — committed to brycepg/pandas by brycepg 6 years ago
- Stop concat from attempting to sort mismatched columns by default Preserve column order upon concatenation to obey least astonishment principle. Allow old behavior to be enabled by adding a boolean ... — committed to TomAugspurger/pandas by brycepg 6 years ago
- Stop concat from attempting to sort mismatched columns by default (#20613) * Stop concat from attempting to sort mismatched columns by default Preserve column order upon concatenation to obey lea... — committed to pandas-dev/pandas by brycepg 6 years ago
This behavior is indeed quite unexpected and I also stumbled over it.
Naively one would expect that the order of columns is preserved. Instead the columns are sorted:
This can be corrected by reindexing with the original columns as follows:
Still it seems counter-intuitive that this automatic sorting takes place and cannot be disabled as far as I know.
Just stumbled upon this same issue when I was concatenating
DataFrames. It’s a little bit annoying if you don’t know about this issue, but actually there is a quick remedy:say
dfsis a list ofDataFramesyou want to concatenate, you can just take the the original column order and feed it back in:Could there be a parameter when creating dataFrame about ordering columns? Like order=False. Thanks a lot
just ran into this while creating a dataframe from a dictionary. Totally surprised me, was counterintuitive and defeated my whole purpose…
column names should be for clarity and the location of columns near each other is an organizational choice of the user to maintain coherency
I encountered this (with 0.13.1) from an edge case not mentioned: combining dataframes each containing unique columns. A naive re-assignment of column names didn’t work:
The columns still get sorted.
Using lists intermediately resolved things, though:Edit: I spoke too soon…
Follow-up: fwiw, column order can be preserved with chained
.joincalls on singular objects:This works for me:
cols = list(df1)+list(df2)df1 = pd.concat([df1, df2])df1 = df1.loc[:, cols]Hey guys, 2 things. 1) Welcome to pandas! I suggest just using more python native types like dictionaries. Stop trying to turn python (or any language) into SQL. 2) This is not technically a bug. It is just an unwanted effect of the code. You can easily overcome it outside the context of the package, and that is what I would think is the correct answer unless someone here takes it upon themselves.
I can’t see why preserving column order (as much as possible) isn’t the default behaviour of concat().
My workaround uses
unique_everseenfrom the Itertools Recipes.I believe
appendcauses the same behavior, FYILooking at this briefly I think this stems from Index.intersection, whose docstring states:
Not sure in which cases they appear/are sorted, but the case when the columns are equal (in your first one) is special cased to return the same result…
Any updates on the status of this thread? I am currently using version 0.22.0 and there still seems to be no proper solution. Procrastination seems to be quite an issue here…
I would also like to note that similar behaviour can be found when concatenating columns, i.e.
axis=1, but only when passing the dataframes in a dictionary:+1 for this feature, just ran across the same deal myself.
@summerela Get the column index and then re-index your new dataframe using the original column index
Conversely, if you need to re-index a smaller subset of columns, you could use this function I made. It can operate with relative indices as well. For example, if you wanted to move a column to the end of a dataframe, but you aren’t sure how many columns may remain after prior processing steps in your script (maybe you’re dropping zero-variance columns, for instance), you could pass a relative index position to
new_indices-->new_indices = [-1]and it will take care of the rest.Edit: Usage would look like the following: