Keeping pandas dataframe column names when using Pipeline with OneHotEncoder

In this post, I will show how to create a simple custom scikit-learner Transformer that allows you to easily deal with OneHotEncoders and Pandas Dataframes.

Recently, I re-started working more with SKLearn again to organize both my preprocessing flows, as well as the flows for estimator training and testing. In my previous jobs, I have also used these pipelines, but always ran into the same major frustration when trying to use the combination:

scikit-learn Pipeline
Pandas Dataframe
scikit-learn OneHotEncoder

This frustration is the fact that after applying a pipeline with a OneHotEncoder in it on a pandas dataframe, I lost all of the column/feature names. And of course, it is possible to fix this afterwards again using the `get_feature_names` functionality of the Pipeline but it always felt like a bit of patching afterwards.

So for example, the following code:

import pandas as pd

cat_column = ['AA', "AA", "AB", "AA", "AA", "AC", "AC"]
df = pd.DataFrame({"categorical_column":cat_column})

print(df)

will give me a simple dataframe with just one categorical column:

  categorical_column
0                 AA
1                 AA
2                 AB
3                 AA
4                 AA
5                 AC
6                 AC

Now if I want to convert this to OneHot encoded data, I have multiple options. One possible option is to make use of the get_dummies functionatity provided by pandas dataframes. The main problem I have with this is that I have no standardized way of dealing with unseen values, missing columns when moving from training to test, etc.

Another way is to make use of the OneHotEncoder preprocessing transformer that is provided by scikit-learn. With the following code, you can easily get an array that is the one hot encoded representation:

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)

result = ohe.fit_transform(df)

print(ohe.categories_)
print(result)

So we indicate we want to create a OneHotEncoder with a dense structure (especially since I am going to work with dataframes later on, no need to work with the sparse matrix here). After that, we fit the encoder on the earlier created dataframe and transform the data right away according to the fitted transformer.

The first print statement will give us the categories found by the OneHotEncoder:

[array(['AA', 'AB', 'AC'], dtype=object)]

As you can see, three separate values were found. The second print statement gives us the following output:

[[1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 0. 1.]]

This output has three columns (one for each of the categories found by the OneHotEncoder and 7 lines, one for each of the original lines in the dataframe). As you can see, all information regarding the column names has vanished in this output, the information is only available in the fitted ohe transformer.

So if you now want to convert this back to a dataframe, you can do this with the following python statement:

print(pd.DataFrame(result, columns=ohe.categories_))

which will give you the following output:

    AA   AB   AC
0  1.0  0.0  0.0
1  1.0  0.0  0.0
2  0.0  1.0  0.0
3  1.0  0.0  0.0
4  1.0  0.0  0.0
5  0.0  0.0  1.0
6  0.0  0.0  1.0

The problem with this is now that this will work easily for this one column, but if you have multiple columns, you will have to do a bit more coding to get the column prefix there (because what happens if you have two categorical columns, both having rows with values AA for example).

Also, this approach of converting the output of the OneHotEncoder transformer directly after it was run works easily whenever you work directly with this encoder. However, if this encoder is part of an arbitrarily complex pipeline, it becomes more tricky.

Since I see this problem occur more than once and I did not really find any solution that did exactly what I wanted/needed, and because I also like to understand how these things work, I decided to go for the next solution: build your own :-)

Implementation of DataFrameOneHotEncoder

To solve this problem, I implemented a new custom scikit-learn transfomer, namely DataFrameOneHotEncoder. The arguments for creating are exactly the same as the arguments for the scikit-learn OneHotEncoder, with the addition of a col_overrule_params. This additional argument is a dictionary where you can provide for each column the parameters that need to be overruled (e.g. if you want to apply drop=first for a specific column for example).

The fit method of this transformer is implemented as follows:

def fit(self, X, y=None):
    """Fit a separate OneHotEncoder for each of the columns in the dataframe

    Args:
        X: dataframe
        y: None, ignored. This parameter exists only for compatibility with
            Pipeline

    Returns
        self

    Raises
        TypeError if X is not of type DataFrame
    """
    if type(X) != pd.DataFrame:
        raise TypeError(f"X should be of type dataframe, not {type(X)}")

    self.onehotencoders_ = []
    self.column_names_ = []

    for c in X.columns:
        # Construct the OHE parameters using the arguments
        ohe_params = {
            "categories": self.categories,
            "drop": self.drop,
            "sparse": False,
            "dtype": self.dtype,
            "handle_unknown": self.handle_unknown,
        }
        # and update it with potential overrule parameters for the current column
        ohe_params.update(self.col_overrule_params.get(c, {}))

        # Regardless of how we got the parameters, make sure we always set the
        # sparsity to False
        ohe_params["sparse"] = False

        # Now create, fit, and store the onehotencoder for current column c
        ohe = OneHotEncoder(**ohe_params)
        self.onehotencoders_.append(ohe.fit(X.loc[:, [c]]))

        # Get the feature names and replace each x0_ with empty and after that
        # surround the categorical value with [] and prefix it with the original
        # column name
        feature_names = ohe.get_feature_names()
        feature_names = [x.replace("x0_", "") for x in feature_names]
        feature_names = [f"{c}[{x}]" for x in feature_names]

        self.column_names_.append(feature_names)

    return self

As you can see, during the fit, we loop over all of the columns in the provided dataframe (after ensuring the X argument is actually a dataframe). For each column, we create a new sklearn OneHotEncoder using the arguments we received and fit this encoder to our current column. After that, we obtain the feature names and replace them with the desired format.

The transform method is the following:

def transform(self, X):
    """Transform X using the one-hot-encoding per column

    Args:
        X: Dataframe that is to be one hot encoded

    Returns:
        Dataframe with onehotencoded data

    Raises
        NotFittedError if the transformer is not yet fitted
        TypeError if X is not of type DataFrame
    """
    if type(X) != pd.DataFrame:
        raise TypeError(f"X should be of type dataframe, not {type(X)}")

    if not hasattr(self, "onehotencoders_"):
        raise NotFittedError(f"{type(self).__name__} is not fitted")

    all_df = []

    for i, c in enumerate(X.columns):
        ohe = self.onehotencoders_[i]

        transformed_col = ohe.transform(X.loc[:, [c]])

        df_col = pd.DataFrame(transformed_col, columns=self.column_names_[i])
        all_df.append(df_col)

    return pd.concat(all_df, axis=1)

(The full code can be found on Github)

A small example for this DataFrameOneHotEncoder, where I apply it on a similar dataframe as we did before, but this time I created two copies of the categorical column) is the following:

# We want to use default parameters on all columns, except for
# the column categorical_column_2, where we want to drop the first value

cat_column = ['AA', "AB", "AA", "AA", "AB", "AB"]
df = pd.DataFrame({"categorical_column":cat_column, "categorical_column_2": cat_column})

df_ohe = DataFrameOneHotEncoder(col_overrule_params={"categorical_column_2":{"drop":"first"}})


print(df_ohe.fit_transform(df))

will give the following output:

   categorical_column[AA]  categorical_column[AB]  categorical_column_2[AB]
0                     1.0                     0.0                       0.0
1                     0.0                     1.0                       1.0
2                     1.0                     0.0                       0.0
3                     1.0                     0.0                       0.0
4                     0.0                     1.0                       1.0
5                     0.0                     1.0                       1.0

As you can see, both the categorical columns have been converted to one hot encoded columns, where each new column name is the combination of the original column followed by the categorical value surrounded with [].

Furthermore, you can also see that for the categorical_column_2, the encoder dropped the first value AA, as requested by the parameters when we created the DataFrameOneHotEncoder.

There problably are a million other ways this can be implemented. For me it was a nice way to start understanding more about the inner details of the sklearn transformers. As stated before, the full source code for the DataFrameOntHotEncoder can be found Github. Always interested in hearing whether this helped anybody else. Also, if you have a better way, also always interested in hearing that!!

Keeping pandas dataframe column names when using Pipeline with OneHotEncoder

Guido Diepen

Guido Diepen

Implementation of DataFrameOneHotEncoder

'Automating' DVC stage generation

Multiple different models with DVC

Multiple different models with DVC

Submitting pyspark jobs to Livy with livy_submit