pandas: BUG: concat unwantedly sorts DataFrame column names if they differ

When concat’ing DataFrames, the column names get alphanumerically sorted if there are any differences between them. If they’re identical across DataFrames, they don’t get sorted. This sort is undocumented and unwanted. Certainly the default behavior should be no-sort. EDIT: the standard order as in SQL would be: columns from df1 (same order as in df1), columns (uniquely) from df2 (less the common columns) (same order as in df2). Example:

df4a = DataFrame(columns=['C','B','D','A'], data=np.random.randn(3,4))
df4b = DataFrame(columns=['C','B','D','A'], data=np.random.randn(3,4))
df5  = DataFrame(columns=['C','B','E','D','A'], data=np.random.randn(3,5))

print "Cols unsorted:", concat([df4a,df4b])
# Cols unsorted:           C         B         D         A

print "Cols sorted", concat([df4a,df5])
# Cols sorted           A         B         C         D         E
``'

About this issue

  • Original URL
  • State: closed
  • Created 11 years ago
  • Reactions: 9
  • Comments: 36 (11 by maintainers)

Commits related to this issue

Most upvoted comments

This behavior is indeed quite unexpected and I also stumbled over it.

 >>> df = pd.DataFrame()

>>> df['b'] = [1,2,3]
>>> df['c'] = [1,2,3]
>>> df['a'] = [1,2,3]
>>> print(df)
   b  c  a
0  1  1  1
1  2  2  2
2  3  3  3

[3 rows x 3 columns]
>>> df2 = pd.DataFrame({'a':[4,5]})
>>> df3 = pd.concat([df, df2])

Naively one would expect that the order of columns is preserved. Instead the columns are sorted:

>>> print(df3)
   a   b   c
0  1   1   1
1  2   2   2
2  3   3   3
0  4 NaN NaN
1  5 NaN NaN

[5 rows x 3 columns]

This can be corrected by reindexing with the original columns as follows:

>>> df4 = df3.reindex_axis(df.columns, axis=1)
>>> print(df4)
    b   c  a
0   1   1  1
1   2   2  2
2   3   3  3
0 NaN NaN  4
1 NaN NaN  5

[5 rows x 3 columns]

Still it seems counter-intuitive that this automatic sorting takes place and cannot be disabled as far as I know.

Just stumbled upon this same issue when I was concatenating DataFrames. It’s a little bit annoying if you don’t know about this issue, but actually there is a quick remedy:

say dfs is a list of DataFrames you want to concatenate, you can just take the the original column order and feed it back in:

df = pd.concat(dfs, axis=0)
df = df[dfs[0].columns]

Could there be a parameter when creating dataFrame about ordering columns? Like order=False. Thanks a lot

just ran into this while creating a dataframe from a dictionary. Totally surprised me, was counterintuitive and defeated my whole purpose…

column names should be for clarity and the location of columns near each other is an organizational choice of the user to maintain coherency

I encountered this (with 0.13.1) from an edge case not mentioned: combining dataframes each containing unique columns. A naive re-assignment of column names didn’t work:

dat = pd.concat([out_dust, in_dust, in_air, out_air])
dat.columns = [out_dust.columns + in_dust.columns + in_air.columns + out_air.columns]

The columns still get sorted. Using lists intermediately resolved things, though:

Edit: I spoke too soon…


Follow-up: fwiw, column order can be preserved with chained .join calls on singular objects:

df1.join([df2, df3]) # sorts columns
df1.join(df2).join(df3) # column order retained

This works for me:

cols = list(df1)+list(df2) df1 = pd.concat([df1, df2]) df1 = df1.loc[:, cols]

Hey guys, 2 things. 1) Welcome to pandas! I suggest just using more python native types like dictionaries. Stop trying to turn python (or any language) into SQL. 2) This is not technically a bug. It is just an unwanted effect of the code. You can easily overcome it outside the context of the package, and that is what I would think is the correct answer unless someone here takes it upon themselves.

I can’t see why preserving column order (as much as possible) isn’t the default behaviour of concat().

My workaround uses unique_everseen from the Itertools Recipes.

columns = unique_everseen([column for df in dfs for column in df.columns])
df = pd.concat(dfs)[columns]

I believe append causes the same behavior, FYI

Looking at this briefly I think this stems from Index.intersection, whose docstring states:

Form the intersection of two Index objects. Sortedness of the result is not guaranteed

Not sure in which cases they appear/are sorted, but the case when the columns are equal (in your first one) is special cased to return the same result…

Any updates on the status of this thread? I am currently using version 0.22.0 and there still seems to be no proper solution. Procrastination seems to be quite an issue here…

I would also like to note that similar behaviour can be found when concatenating columns, i.e. axis=1, but only when passing the dataframes in a dictionary:

>>> df4a = DataFrame(columns=['C','B','D','A'], data=np.random.randn(3,4))
>>> df4b = DataFrame(columns=['C','B','D','A'], data=np.random.randn(3,4))
>>> df5  = DataFrame(columns=['C','B','E','D','A'], data=np.random.randn(3, 5))

>>> pd.concat([df4a, df5], axis=1).columns
Index(['C', 'B', 'D', 'A', 'C', 'B', 'E', 'D', 'A'], dtype='object')
>>> pd.concat({'df4a': df4a, 'df4b': df4b}, axis=1).columns.levels
FrozenList([['df4a', 'df4b'], ['C', 'B', 'D', 'A']])
>>> pd.concat({'df4a': df4a, 'df5': df5}, axis=1).columns.levels
FrozenList([['df4a', 'df5'], ['A', 'B', 'C', 'D', 'E']])

+1 for this feature, just ran across the same deal myself.

@summerela Get the column index and then re-index your new dataframe using the original column index

# assuming you have two dataframes, `df_train` & `df_test` (with the same columns) 
# that you want to concatenate

# get the columns from one of them
all_columns = df_train.columns

# concatenate them
df_concat = pd.concat([df_train,
                       df_test])

# finally, re-index the new dataframe using the original column index
df_concat = df_concat.ix[:, all_columns]

Conversely, if you need to re-index a smaller subset of columns, you could use this function I made. It can operate with relative indices as well. For example, if you wanted to move a column to the end of a dataframe, but you aren’t sure how many columns may remain after prior processing steps in your script (maybe you’re dropping zero-variance columns, for instance), you could pass a relative index position to new_indices --> new_indices = [-1] and it will take care of the rest.

def reindex_columns(dframe=None, columns=None, new_indices=None):
    """
    Reorders the columns of a dataframe as specified by
    `reorder_indices`. Values of `columns` should align with their
    respective values in `new_indices`.

    `dframe`: pandas dataframe.

    `columns`: list,pandas.core.index.Index, or numpy array; columns to
    reindex.

    `reorder_indices`: list of integers or numpy array; indices
    corresponding to where each column should be inserted during
    re-indexing.
    """
    print("Re-indexing columns.")
    try:
        df = dframe.copy()

        # ensure parameters are of correct type and length
        assert isinstance(columns, (pd.core.index.Index,
                                    list,
                                    np.array)),\
        "`columns` must be of type `pandas.core.index.Index` or `list`"

        assert isinstance(new_indices,
                          list),\
        "`reorder_indices` must be of type `list`"

        assert len(columns) == len(new_indices),\
        "Length of `columns` and `reorder_indices` must be equal"

        # check for negative values in `new_indices`
        if any(idx < 0 for idx in new_indices):

            # get a list of the negative values
            negatives = [value for value
                         in new_indices
                         if value < 0]

            # find the index location for each negative value in
            # `new_indices`
            negative_idx_locations = [new_indices.index(negative)
                                      for negative in negatives]

            # zip the lists
            negative_zipped = list(zip(negative_idx_locations,
                                       negatives))

            # replace the negatives in `new_indices` with their
            # absolute position in the index
            for idx, negative in negative_zipped:
                new_indices[idx] = df.columns.get_loc(df.columns[
                                                          negative])

        # re-order the index now
        # get all columns
        all_columns = df.columns

        # drop the columns that need to be re-indexed
        all_columns = all_columns.drop(columns)

        # now re-insert them at the specified locations
        zipped_columns = list(zip(new_indices,
                                  columns))

        for idx, column in zipped_columns:
            all_columns = all_columns.insert(idx,
                                             column)
        # re-index the dataframe
        df = df.ix[:, all_columns]

        print("Successfully re-indexed dataframe.")

    except Exception as e:
        print(e)
        print("Could not re-index columns. Something went wrong.")

    return df

Edit: Usage would look like the following:

# move 'Column_1' to the end, move 'Column_2' to the beginning
df = reindex_columns(dframe=df,
                     columns=['Column_1', 'Column_2'],
                     new_indices=[-1, 0])