Representing Categorical Data with Target Encoding

04 Mar 2019 - Tags: prediction

Most machine learning algorithms require the input data to be a numeric matrix, where each row is a sample and each column is a feature. This makes sense for continuous features, where a larger number obviously corresponds to a larger value (features such as voltage, purchase amount, or number of clicks). How to represent categorical features is less obvious. Categorical features (such as state, merchant ID, domain name, or phone number) don’t have an intrinsic ordering, and so most of the time we can’t just represent them with random numbers. Who’s to say that Colorado is “greater than” Minnesota? Or DHL “less than” FedEx? To represent categorical data, we need to find a way to encode the categories numerically.

There are quite a few ways to encode categorical data. We can simply assign each category an integer randomly (called label encoding). Alternatively, we can create a new feature for each possible category, and set the feature to be 1 for each sample having that category, and otherwise set it to be 0 (called one-hot encoding). If we’re using neural networks, we could let our network learn the embeddings of categories in a high-dimensional space (called entity embedding, or in neural NLP models often just “embedding”).

However, these methods all have drawbacks. Label encoding doesn’t work well at all with non-ordinal categorical features. One-hot encoding leads to a humongous number of added features when your data contains a large number of categories. Entity embedding can only be used with neural network models (or at least with models which are trained using stochastic gradient descent).

A different encoding method which we’ll try in this post is called target encoding (also known as “mean encoding”, and really should probably be called “mean target encoding”). With target encoding, each category is replaced with the mean target value for samples having that category. The “target value” is the y-variable, or the value our model is trying to predict. This allows us to encode an arbitrary number of categories without increasing the dimensionality of our data!

Of course, there are drawbacks to target encoding as well. Target encoding introduces noise into the encoding of the categorical variables (noise which comes from the noise in the target variable itself). Also, naively applying target encoding can allow data leakage, leading to overfitting and poor predictive performance. To fix that problem, we’ll have to construct target encoders which prevent data leakage. And even with those leak-proof target encoders, there are situations where one would be better off using one-hot or other encoding methods. One-hot can be better in situations with few categories, or with data where there are strong interaction effects.

In this post we’ll evaluate different encoding schemes, build a cross-fold target encoder to mitigate the drawbacks of the naive target encoder, and determine how the performance of predictive models change based on the type of category encoding used, the number of categories in the dataset, and the presence of interaction effects.

Outline

Data
Baseline
Label Encoding
One-hot Encoding
Target Encoding
Cross-Fold Target Encoding
Leave-one-out Target Encoding
Effect of the Learning Algorithm
Dependence on the Number of Categories
Effect of Category Imbalance
Effect of Interactions
Suggestions

First let’s import the packages we’ll be using.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.metrics import mean_absolute_error, make_scorer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import KFold 
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import BayesianRidge
from xgboost import XGBRegressor

np.random.seed(12345)

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

Data

To evaluate the effectiveness of different encoding algorithms, we’ll want to be able to generate data with different numbers of samples, features, and categories. Let’s make a function to generate categorical datasets, which allows us to set these different aspects of the data. The categories have a direct effect on the target variable which we’ll try to predict.

def make_categorical_regression(n_samples=100,
                                n_features=10,
                                n_informative=10,
                                n_categories=10,
                                imbalance=0.0,
                                noise=1.0,
                                n_cont_features=0,
                                cont_weight=0.1,
                                interactions=0.0):
    """Generate a regression problem with categorical features.
  
    Parameters
    ----------
    n_samples : int > 0
        Number of samples to generate
        Default = 100
    n_features : int > 0
        Number of categorical features to generate
        Default = 10
    n_informative : int >= 0
        Number of features to carry information about the target.
        Default = 10
    n_categories : int > 0
        Number of categories per feature.  Default = 10
    imbalance : float > 0
        How much imbalance there is in the number of occurrences of
        each category.  Larger values yield a higher concentration
        of samples in only a few categories.  An imbalance of 0 
        yields the same number of samples in each category.
        Default = 0.0
    noise : float > 0
        Noise to add to target.  Default = 1.0
    n_cont_features : int >= 0
        Number of continuous (non-categorical) features.
        Default = 0
    cont_weight : float > 0
        Weight of the continuous variables' effect.
        Default = 0.1
    interactions : float >= 0 and <= 1
        Proportion of the variance due to interaction effects.
        Note that this only adds interaction effects between the 
        categorical features, not the continuous features.
        Default = 0.0
        
    Returns
    -------
    X : pandas DataFrame
        Features.  Of shape (n_samples, n_features+n_cont_features)
    y : pandas Series of shape (n_samples,)
        Target variable.
    """
    
    
    def beta_binomial(n, a, b):
        """Beta-binomial probability mass function.
        
        Parameters
        ----------
        n : int
            Number of trials
        a : float > 0
            Alpha parameter
        b : float > 0
            Beta parameter
            
        Returns
        -------
        ndarray of size (n,)
            Probability mass function.
        """
        from scipy.special import beta
        from scipy.misc import comb
        k = np.arange(n+1)
        return comb(n, k)*beta(k+a, n-k+b)/beta(a, b)


    # Check inputs
    if not isinstance(n_samples, int):
        raise TypeError('n_samples must be an int')
    if n_samples < 1:
        raise ValueError('n_samples must be one or greater')
    if not isinstance(n_features, int):
        raise TypeError('n_features must be an int')
    if n_features < 1:
        raise ValueError('n_features must be one or greater')
    if not isinstance(n_informative, int):
        raise TypeError('n_informative must be an int')
    if n_informative < 0:
        raise ValueError('n_informative must be non-negative')
    if not isinstance(n_categories, int):
        raise TypeError('n_categories must be an int')
    if n_categories < 1:
        raise ValueError('n_categories must be one or greater')
    if not isinstance(imbalance, float):
        raise TypeError('imbalance must be a float')
    if imbalance < 0:
        raise ValueError('imbalance must be non-negative')
    if not isinstance(noise, float):
        raise TypeError('noise must be a float')
    if noise < 0:
        raise ValueError('noise must be positive')
    if not isinstance(n_cont_features, int):
        raise TypeError('n_cont_features must be an int')
    if n_cont_features < 0:
        raise ValueError('n_cont_features must be non-negative')
    if not isinstance(cont_weight, float):
        raise TypeError('cont_weight must be a float')
    if cont_weight < 0:
        raise ValueError('cont_weight must be non-negative')
    if not isinstance(interactions, float):
        raise TypeError('interactions must be a float')
    if interactions < 0:
        raise ValueError('interactions must be non-negative')
        
    # Generate random categorical data (using category probs drawn
    # from a beta-binomial dist w/ alpha=1, beta=imbalance+1)
    cat_probs = beta_binomial(n_categories-1, 1.0, imbalance+1)
    categories = np.empty((n_samples, n_features), dtype='uint64')
    for iC in range(n_features):
        categories[:,iC] = np.random.choice(np.arange(n_categories),
                                            size=n_samples,
                                            p=cat_probs)
        
    # Generate random values for each category
    cat_vals = np.random.randn(n_categories, n_features)
    
    # Set non-informative columns' effect to 0
    cat_vals[:,:(n_features-n_informative)] = 0
    
    # Compute target variable from categories and their values
    y = np.zeros(n_samples)
    for iC in range(n_features):
        y += (1.0-interactions) * cat_vals[categories[:,iC], iC]
      
    # Add interaction effects
    if interactions > 0:
        for iC1 in range(n_informative):
            for iC2 in range(iC1+1, n_informative):
                int_vals = np.random.randn(n_categories,
                                           n_categories)
                y += interactions * int_vals[categories[:,iC1],
                                             categories[:,iC2]]
    
    # Add noise
    y += noise*np.random.randn(n_samples)
    
    # Generate dataframe from categories
    cat_strs = [''.join([chr(ord(c)+49) for c in str(n)]) 
                for n in range(n_categories)]
    X = pd.DataFrame()
    for iC in range(n_features):
        col_str = 'categorical_'+str(iC)
        X[col_str] = [cat_strs[i] for i in categories[:,iC]]
        
    # Add continuous features
    for iC in range(n_cont_features):
        col_str = 'continuous_'+str(iC)
        X[col_str] = cont_weight*np.random.randn(n_samples)
        y += np.random.randn()*X[col_str]
                    
    # Generate series from target
    y = pd.Series(data=y, index=X.index)
    
    # Return features and target
    return X, y

Now, we can easily generate data to test our encoders on:

# Generate categorical data and target
X, y = make_categorical_regression(n_samples=2000,
                                   n_features=10,
                                   n_categories=100,
                                   n_informative=1,
                                   imbalance=2.0)

# Split into test and training data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5)

The ten features in the dataset we generated are all categorical:

X_train.sample(10)

	categorical_0	categorical_1	categorical_2	categorical_3	categorical_4	categorical_5	categorical_6	categorical_7	categorical_8	categorical_9
792	cf	c	d	a	ed	ca	dj	g	b	bj
276	di	b	bd	fg	d	j	e	a	hc	h
1016	ei	di	cj	he	hb	gh	b	bh	df	c
1372	ca	c	be	ce	cg	bf	de	fe	ba	fd
1860	db	dh	ba	bh	di	bh	db	bi	gf	bi
1431	h	ce	ea	i	eb	g	da	da	fc	e
328	j	db	df	fa	fe	g	c	h	da	bg
1708	cd	ci	f	be	e	fb	dc	bi	ec	da
1567	ei	cj	ch	bc	bb	f	ch	bi	c	he
1027	bi	bh	bf	ba	dc	da	g	cc	bi	ee

Using the pandas package, these are stored as the “object” datatype:

X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 523 to 583
Data columns (total 10 columns):
categorical_0    1000 non-null object
categorical_1    1000 non-null object
categorical_2    1000 non-null object
categorical_3    1000 non-null object
categorical_4    1000 non-null object
categorical_5    1000 non-null object
categorical_6    1000 non-null object
categorical_7    1000 non-null object
categorical_8    1000 non-null object
categorical_9    1000 non-null object
dtypes: object(10)
memory usage: 85.9+ KB

While all the features are categorical, the target variable is continuous:

y_train.hist(bins=20)
plt.xlabel('Target value')
plt.ylabel('Number of samples')
plt.show()

svg

Now the question is: which encoding scheme best allows us to glean the most information from the categorical features, leading to the best predictions of the target variable?

Baseline

For comparison, how well would we do if we just predicted the mean target value for all samples? We’ll use the mean absolute error (MAE) as our performance metric.

mean_absolute_error(y_train, 
                    np.full(y_train.shape[0], y_train.mean()))

1.139564825988808

So our predictive models should definitely be shooting for a mean absolute error of less than that! But, we added random noise with a standard deviation of 1, so even if our model is perfect, the best MAE we can expect is:

mean_absolute_error(np.random.randn(10000), 
                    np.zeros(10000))

0.7995403442995148

Label Encoding

The simplest categorical encoding method is label encoding, where each category is simply replaced with a unique integer. However, there is no intrinsic relationship between the categories and the numbers being used to replace them. In the diagram below, category A is replaced with 0, and B with 1 - but there is no reason to think that category A is somehow greater than category B.

We’ll create a scikit-learn-compatible transformer class with which to label encode our data. Note that we could instead just use scikit-learn’s LabelEncoder - although their version is a little wasteful in that it doesn’t choose a data type efficiently.

class LabelEncoder(BaseEstimator, TransformerMixin):
    """Label encoder.
    
    Replaces categorical column(s) with integer labels
    for each unique category in original column.

    """
    
    def __init__(self, cols=None):
        """Label encoder.
        
        Parameters
        ----------
        cols : list of str
            Columns to label encode.  Default is to label 
            encode all categorical columns in the DataFrame.
        """
        if isinstance(cols, str):
            self.cols = [cols]
        else:
            self.cols = cols
        
        
    def fit(self, X, y):
        """Fit label encoder to X and y
        
        Parameters
        ----------
        X : pandas DataFrame, shape [n_samples, n_columns]
            DataFrame containing columns to label encode
        y : pandas Series, shape = [n_samples]
            Target values.
            
        Returns
        -------
        self : encoder
            Returns self.
        """
        
        # Encode all categorical cols by default
        if self.cols is None:
            self.cols = [c for c in X if str(X[c].dtype)=='object']

        # Check columns are in X
        for col in self.cols:
            if col not in X:
                raise ValueError('Column \''+col+'\' not in X')

        # Create the map from objects to integers for each column
        self.maps = dict() #dict to store map for each column
        for col in self.cols:
            self.maps[col] = dict(zip(
                X[col].values, 
                X[col].astype('category').cat.codes.values
            ))
                        
        # Return fit object
        return self

        
    def transform(self, X, y=None):
        """Perform the label encoding transformation.
        
        Parameters
        ----------
        X : pandas DataFrame, shape [n_samples, n_columns]
            DataFrame containing columns to label encode
            
        Returns
        -------
        pandas DataFrame
            Input DataFrame with transformed columns
        """
        Xo = X.copy()
        for col, tmap in self.maps.items():
          
            # Map the column
            Xo[col] = Xo[col].map(tmap)
            
            # Convert to appropriate datatype
            max_val = max(tmap.values())
            if Xo[col].isnull().any(): #nulls, so use float!
                if max_val < 8388608:
                    dtype = 'float32'
                else:
                    dtype = 'float64'
            else:
                if max_val < 256:
                    dtype = 'uint8'
                elif max_val < 65536:
                    dtype = 'uint16'
                elif max_val < 4294967296:
                    dtype = 'uint32'
                else:
                    dtype = 'uint64'
            Xo[col] = Xo[col].astype(dtype)
            
        # Return encoded dataframe
        return Xo
            
            
    def fit_transform(self, X, y=None):
        """Fit and transform the data via label encoding.
        
        Parameters
        ----------
        X : pandas DataFrame, shape [n_samples, n_columns]
            DataFrame containing columns to label encode
        y : pandas Series, shape = [n_samples]
            Target values

        Returns
        -------
        pandas DataFrame
            Input DataFrame with transformed columns
        """
        return self.fit(X, y).transform(X, y)

Now we can convert the categories to integers:

# Label encode the categorical data
le = LabelEncoder()
X_label_encoded = le.fit_transform(X_train, y_train)
X_label_encoded.sample(10)

	categorical_0	categorical_1	categorical_2	categorical_3	categorical_4	categorical_5	categorical_6	categorical_7	categorical_8	categorical_9
884	13	40	83	5	82	32	44	6	23	55
1098	15	66	34	70	56	36	69	3	86	48
853	56	6	12	30	27	25	56	10	54	6
1667	17	6	77	8	22	65	22	33	5	22
1136	56	5	8	34	0	31	48	80	50	56
362	0	15	47	50	42	7	9	3	81	34
1977	21	19	22	56	56	13	1	43	13	34
1784	35	66	37	66	4	1	76	8	1	10
127	82	42	11	63	12	39	58	76	59	67
1489	2	66	35	1	4	1	45	24	23	24

But again, these integers aren’t related to the categories in any meaningful way - aside from the fact that each unique integer corresponds to a unique category.

We can create a processing pipeline that label-encodes the data, and then uses a Bayesian ridge regression to predict the target variable, and compute the cross-validated mean absolute error of that model.

# Regression model
model_le = Pipeline([
    ('label-encoder', LabelEncoder()),
    ('scaler', StandardScaler()),
    ('imputer', SimpleImputer(strategy='mean')),
    ('regressor', BayesianRidge())
])

# Cross-validated MAE
mae_scorer = make_scorer(mean_absolute_error)
scores = cross_val_score(model_le, X_train, y_train, 
                         cv=3, scoring=mae_scorer)
print('Cross-validated MAE: %0.3f +/- %0.3f'
      % (scores.mean(), scores.std()))

Cross-validated MAE: 1.132 +/- 0.022

That’s not much better than just predicting the mean!

The error is similarly poor on validation data.

# MAE on test data
model_le.fit(X_train, y_train)
y_pred = model_le.predict(X_test)
test_mae = mean_absolute_error(y_test, y_pred)
print('Validation MAE: %0.3f' % test_mae)

Validation MAE: 1.176

One-hot Encoding

One-hot encoding, sometimes called “dummy coding”, encodes the categorical information a little more intelligently. Instead of assigning random integers to categories, a new feature is created for each category. For each sample, the new feature is 1 if the sample’s category matches the new feature, otherwise the value is 0. This allows us to encode the categorical information numerically, without loss of information, but ends up adding a lot of columns when the original categorical feature has many unique categories.

Like before, we’ll create an sklearn transformer class to perform one-hot encoding. And again we could have used sklearn’s built-in OneHotEncoder class.

class OneHotEncoder(BaseEstimator, TransformerMixin):
    """One-hot encoder.
    
    Replaces categorical column(s) with binary columns 
    for each unique value in original column.

    """
    
    def __init__(self, cols=None, reduce_df=False):
        """One-hot encoder.
        
        Parameters
        ----------
        cols : list of str
            Columns to one-hot encode.  Default is to one-hot 
            encode all categorical columns in the DataFrame.
        reduce_df : bool
            Whether to use reduced degrees of freedom for encoding
            (that is, add N-1 one-hot columns for a column with N 
            categories). E.g. for a column with categories A, B, 
            and C: When reduce_df is True, A=[1, 0], B=[0, 1],
            and C=[0, 0].  When reduce_df is False, A=[1, 0, 0], 
            B=[0, 1, 0], and C=[0, 0, 1]
            Default = False
        """
        if isinstance(cols, str):
            self.cols = [cols]
        else:
            self.cols = cols
        self.reduce_df = reduce_df
        
        
    def fit(self, X, y):
        """Fit one-hot encoder to X and y
        
        Parameters
        ----------
        X : pandas DataFrame, shape [n_samples, n_columns]
            DataFrame containing columns to encode
        y : pandas Series, shape = [n_samples]
            Target values.
            
        Returns
        -------
        self : encoder
            Returns self.
        """
        
        # Encode all categorical cols by default
        if self.cols is None:
            self.cols = [c for c in X 
                         if str(X[c].dtype)=='object']

        # Check columns are in X
        for col in self.cols:
            if col not in X:
                raise ValueError('Column \''+col+'\' not in X')

        # Store each unique value
        self.maps = dict() #dict to store map for each column
        for col in self.cols:
            self.maps[col] = []
            uniques = X[col].unique()
            for unique in uniques:
                self.maps[col].append(unique)
            if self.reduce_df:
                del self.maps[col][-1]
        
        # Return fit object
        return self

        
    def transform(self, X, y=None):
        """Perform the one-hot encoding transformation.
        
        Parameters
        ----------
        X : pandas DataFrame, shape [n_samples, n_columns]
            DataFrame containing columns to one-hot encode
            
        Returns
        -------
        pandas DataFrame
            Input DataFrame with transformed columns
        """
        Xo = X.copy()
        for col, vals in self.maps.items():
            for val in vals:
                new_col = col+'_'+str(val)
                Xo[new_col] = (Xo[col]==val).astype('uint8')
            del Xo[col]
        return Xo
            
            
    def fit_transform(self, X, y=None):
        """Fit and transform the data via one-hot encoding.
        
        Parameters
        ----------
        X : pandas DataFrame, shape [n_samples, n_columns]
            DataFrame containing columns to one-hot encode
        y : pandas Series, shape = [n_samples]
            Target values

        Returns
        -------
        pandas DataFrame
            Input DataFrame with transformed columns
        """
        return self.fit(X, y).transform(X, y)

Now, instead of replacing categories with integer labels, we’ve create a new column for each category in each original column. The value in a given column is 1 when the original category matches, otherwise the value is 0. The values in the dataframe below are mostly 0s because the data we generated has so many categories.

# One-hot-encode the categorical data
ohe = OneHotEncoder()
X_one_hot = ohe.fit_transform(X_train, y_train)
X_one_hot.sample(10)

	categorical_0_h	...
294	0	...
900	1	...
39	0	...
443	0	...
1268	0	...
280	0	...
382	0	...
624	0	...
430	0	...
836	0	...

10 rows × 853 columns

Note that although we’ve now encoded the categorical data in a meaningful way, our data matrix is huge!

# Compare sizes
print('Original size:', X_train.shape)
print('One-hot encoded size:', X_one_hot.shape)

Original size: (1000, 10)
One-hot encoded size: (1000, 853)

We can fit the same model with the one-hot encoded data as we fit to the label-encoded data, and compute the cross-validated error.

# Regression model
model_oh = Pipeline([
    ('encoder', OneHotEncoder()),
    ('scaler', StandardScaler()),
    ('imputer', SimpleImputer(strategy='mean')),
    ('regressor', BayesianRidge())
])

# Cross-validated MAE
scores = cross_val_score(model_oh, X_train, y_train, 
                         cv=3, scoring=mae_scorer)
print('Cross-validated MAE: %0.3f +/- %0.3f'
      % (scores.mean(), scores.std()))

Cross-validated MAE: 1.039 +/- 0.028

Unlike with label encoding, when using one-hot encoding our predictions are definitely better than just guessing the mean - but not by a whole lot! Performance on the validation dataset is about the same:

# MAE on test data
model_oh.fit(X_train, y_train)
y_pred = model_oh.predict(X_test)
test_mae = mean_absolute_error(y_test, y_pred)
print('Validation MAE: %0.3f' % test_mae)

Validation MAE: 1.029

Target Encoding

The problem with one-hot encoding is that it greatly increases the dimensionality of the training data (by adding a new feature for each unique category in the original dataset). This often leads to poorer model performance due to the curse of dimensionality - i.e., all else being equal, it is harder for machine learning algorithms to learn from data which has more dimensions.

Target encoding allows us to retain actual useful information about the categories (like one-hot encoding, but unlike label encoding), while keeping the dimensionality of our data the same as the unencoded data (like label encoding, but unlike one-hot encoding). To target encode data, for each feature, we simply replace each category with the mean target value for samples which have that category.

Let’s create a transformer class which performs this target encoding.

class TargetEncoder(BaseEstimator, TransformerMixin):
    """Target encoder.
    
    Replaces categorical column(s) with the mean target value for
    each category.

    """
    
    def __init__(self, cols=None):
        """Target encoder
        
        Parameters
        ----------
        cols : list of str
            Columns to target encode.  Default is to target 
            encode all categorical columns in the DataFrame.
        """
        if isinstance(cols, str):
            self.cols = [cols]
        else:
            self.cols = cols
        
        
    def fit(self, X, y):
        """Fit target encoder to X and y
        
        Parameters
        ----------
        X : pandas DataFrame, shape [n_samples, n_columns]
            DataFrame containing columns to encode
        y : pandas Series, shape = [n_samples]
            Target values.
            
        Returns
        -------
        self : encoder
            Returns self.
        """
        
        # Encode all categorical cols by default
        if self.cols is None:
            self.cols = [col for col in X 
                         if str(X[col].dtype)=='object']

        # Check columns are in X
        for col in self.cols:
            if col not in X:
                raise ValueError('Column \''+col+'\' not in X')

        # Encode each element of each column
        self.maps = dict() #dict to store map for each column
        for col in self.cols:
            tmap = dict()
            uniques = X[col].unique()
            for unique in uniques:
                tmap[unique] = y[X[col]==unique].mean()
            self.maps[col] = tmap
            
        return self

        
    def transform(self, X, y=None):
        """Perform the target encoding transformation.
        
        Parameters
        ----------
        X : pandas DataFrame, shape [n_samples, n_columns]
            DataFrame containing columns to encode
            
        Returns
        -------
        pandas DataFrame
            Input DataFrame with transformed columns
        """
        Xo = X.copy()
        for col, tmap in self.maps.items():
            vals = np.full(X.shape[0], np.nan)
            for val, mean_target in tmap.items():
                vals[X[col]==val] = mean_target
            Xo[col] = vals
        return Xo
            
            
    def fit_transform(self, X, y=None):
        """Fit and transform the data via target encoding.
        
        Parameters
        ----------
        X : pandas DataFrame, shape [n_samples, n_columns]
            DataFrame containing columns to encode
        y : pandas Series, shape = [n_samples]
            Target values (required!).

        Returns
        -------
        pandas DataFrame
            Input DataFrame with transformed columns
        """
        return self.fit(X, y).transform(X, y)

Now, instead of creating a bazillion columns (like with one-hot encoding), we can simply replace each category with the mean target value for that category. This allows us to represent the categorical information in the same dimensionality, while retaining some information about the categories. By target-encoding the features matrix, we get a matrix of the same size, but filled with continuous values instead of categories:

# Target encode the categorical data
te = TargetEncoder()
X_target_encoded = te.fit_transform(X_train, y_train)
X_target_encoded.sample(10)

	categorical_0	categorical_1	categorical_2	categorical_3	categorical_4	categorical_5	categorical_6	categorical_7	categorical_8	categorical_9
711	-0.030636	0.192812	0.269273	0.131628	0.319769	0.190861	0.142159	0.393587	-0.766454	-0.312227
475	0.365423	-0.605778	-0.258930	0.038321	-0.283131	0.046135	0.316052	-0.120822	-0.425927	-0.163454
273	0.110462	1.093313	0.309760	0.474308	0.090909	0.003120	1.558923	0.244971	-0.387846	-0.327537
9	-0.848020	0.300673	0.125095	-0.650361	-0.252932	0.293856	-0.197504	0.050085	-0.587633	-0.413439
275	0.126068	0.180776	-0.143977	-0.131238	0.090909	-0.760367	0.326620	-0.037488	-0.121713	-0.244310
336	0.543412	-0.045947	0.180144	0.279675	-0.532591	0.338287	0.071977	0.113531	0.527567	1.290724
1462	-0.065061	-0.605778	0.172445	-0.268622	-0.283131	-0.270112	-0.197504	0.068821	0.371461	0.966579
1436	0.186988	0.564015	0.135396	0.474308	0.338006	0.479294	0.063384	0.342170	-0.054090	-0.163454
1611	0.350326	-0.188197	-0.537682	-0.391143	0.212399	-1.811086	0.204642	-0.622682	-0.425927	-0.489371
166	0.480637	0.975462	0.135396	-0.131238	0.323870	0.279502	-0.533179	0.050085	0.016449	1.410449

Note that the size of our target-encoded matrix is the same size as the original (unlike the huge one-hot transformed matrix):

# Compare sizes
print('Original size:', X_train.shape)
print('Target encoded size:', X_target_encoded.shape)

Original size: (1000, 10)
Target encoded size: (1000, 10)

Also, each column has exactly as many unique continuous values as it did categories. This is because we’ve simply replaced the category with the mean target value for that category.

# Compare category counts
print('Original:')
print(X_train.nunique())
print('\nTarget encoded:')
print(X_target_encoded.nunique())

Original:
categorical_0    84
categorical_1    81
categorical_2    85
categorical_3    88
categorical_4    84
categorical_5    86
categorical_6    88
categorical_7    88
categorical_8    90
categorical_9    79
dtype: int64
 
Target encoded:
categorical_0    84
categorical_1    81
categorical_2    85
categorical_3    88
categorical_4    84
categorical_5    86
categorical_6    88
categorical_7    88
categorical_8    90
categorical_9    79
dtype: int64

If we fit the same model as before, but now after target-encoding the categories, the error of our model is far lower!

# Regression model
model_te = Pipeline([
    ('encoder', TargetEncoder()),
    ('scaler', StandardScaler()),
    ('imputer', SimpleImputer(strategy='mean')),
    ('regressor', BayesianRidge())
])

# Cross-validated MAE
scores = cross_val_score(model_te, X_train, y_train, 
                         cv=3, scoring=mae_scorer)
print('Cross-validated MAE: %0.3f +/- %0.3f'
      % (scores.mean(), scores.std()))

Cross-validated MAE: 0.940 +/- 0.030

The performance on the test data is about the same, but slightly better, because we’ve given it more samples on which to train.

# MAE on test data
model_te.fit(X_train, y_train)
y_pred = model_te.predict(X_test)
test_mae = mean_absolute_error(y_test, y_pred)
print('Validation MAE: %0.3f' % test_mae)

Validation MAE: 0.933

While the error is lower using target encoding than with one-hot encoding, in naively target-encoding our categories, we’ve introduced a data leak from the target variable for one sample into the features for that same sample!

In the diagram above, notice how the i-th sample’s target value is used in the computation of the mean target value for the i-th sample’s category, and then the i-th sample’s category is replaced with that mean. Leaking the target variable into our predictors like that causes our learning algorithm to over-depend on the target-encoded features, which results in the algorithm overfitting on the data. Although we gain predictive power by keeping the dimensionality of our training data reasonable, we loose a lot of that gain by allowing our model to overfit to the target-encoded columns!

Cross-Fold Target Encoding

To clamp down on the data leakage, we need to ensure that we’re not using the using the target value from a given sample to compute its target-encoded values. However, we can still use other samples in the training data to compute the mean target values for this sample’s category.

There are a few different ways we can do this. We could compute the per-category target means in a cross-fold fashion, or by leaving the current sample out (leave-one-out).

First we’ll try cross-fold target encoding, where we’ll split the data up into \( N \) folds, and compute the means for each category in the \( i \)-th fold using data in all the other folds. The diagram below illustrates an example using 2 folds.

Let’s create a transformer class to perform the cross-fold target encoding. There are a few things we need to watch out for now which we didn’t have to worry about with the naive target encoder. First, we may end up with NaNs (empty values) even when there were categories in the original dataframe. This will happen for a category that appears in one fold, but when there are no examples of that category in the other folds. Also, we can’t perform cross-fold encoding on our test data, because we don’t have any target values for which to compute the category means! So, we have to use the category means from the training data in that case.

class TargetEncoderCV(TargetEncoder):
    """Cross-fold target encoder.
    """
    
    def __init__(self, n_splits=3, shuffle=True, cols=None):
        """Cross-fold target encoding for categorical features.
        
        Parameters
        ----------
        n_splits : int
            Number of cross-fold splits. Default = 3.
        shuffle : bool
            Whether to shuffle the data when splitting into folds.
        cols : list of str
            Columns to target encode.
        """
        self.n_splits = n_splits
        self.shuffle = shuffle
        self.cols = cols
        

    def fit(self, X, y):
        """Fit cross-fold target encoder to X and y
        
        Parameters
        ----------
        X : pandas DataFrame, shape [n_samples, n_columns]
            DataFrame containing columns to encode
        y : pandas Series, shape = [n_samples]
            Target values.
            
        Returns
        -------
        self : encoder
            Returns self.
        """
        self._target_encoder = TargetEncoder(cols=self.cols)
        self._target_encoder.fit(X, y)
        return self

    
    def transform(self, X, y=None):
        """Perform the target encoding transformation.

        Uses cross-fold target encoding for the training fold,
        and uses normal target encoding for the test fold.

        Parameters
        ----------
        X : pandas DataFrame, shape [n_samples, n_columns]
            DataFrame containing columns to encode

        Returns
        -------
        pandas DataFrame
            Input DataFrame with transformed columns
        """

        # Use target encoding from fit() if this is test data
        if y is None:
            return self._target_encoder.transform(X)

        # Compute means for each fold
        self._train_ix = []
        self._test_ix = []
        self._fit_tes = []
        kf = KFold(n_splits=self.n_splits, shuffle=self.shuffle)
        for train_ix, test_ix in kf.split(X):
            self._train_ix.append(train_ix)
            self._test_ix.append(test_ix)
            te = TargetEncoder(cols=self.cols)
            if isinstance(X, pd.DataFrame):
                self._fit_tes.append(te.fit(X.iloc[train_ix,:],
                                            y.iloc[train_ix]))
            elif isinstance(X, np.ndarray):
                self._fit_tes.append(te.fit(X[train_ix,:],
                                            y[train_ix]))
            else:
                raise TypeError('X must be DataFrame or ndarray')

        # Apply means across folds
        Xo = X.copy()
        for ix in range(len(self._test_ix)):
            test_ix = self._test_ix[ix]
            if isinstance(X, pd.DataFrame):
                Xo.iloc[test_ix,:] = \
                    self._fit_tes[ix].transform(X.iloc[test_ix,:])
            elif isinstance(X, np.ndarray):
                Xo[test_ix,:] = \
                    self._fit_tes[ix].transform(X[test_ix,:])
            else:
                raise TypeError('X must be DataFrame or ndarray')
        return Xo

            
    def fit_transform(self, X, y=None):
        """Fit and transform the data via target encoding.
        
        Parameters
        ----------
        X : pandas DataFrame, shape [n_samples, n_columns]
            DataFrame containing columns to encode
        y : pandas Series, shape = [n_samples]
            Target values (required!).

        Returns
        -------
        pandas DataFrame
            Input DataFrame with transformed columns
        """
        return self.fit(X, y).transform(X, y)

With this encoder, we can convert the categories into continuous values, just like we did with the naive target encoding.

# Cross-fold Target encode the categorical data
te = TargetEncoderCV()
X_target_encoded_cv = te.fit_transform(X_train, y_train)
X_target_encoded_cv.sample(10)

	categorical_0	categorical_1	categorical_2	categorical_3	categorical_4	categorical_5	categorical_6	categorical_7	categorical_8	categorical_9
236	0.233017	0.266851	0.620411	-0.0917691	0.0238002	0.205387	-0.182844	0.209843	-0.201101	1.95302
103	0.0419825	-0.133423	-0.152355	0.768532	-0.105238	-0.0010254	-0.0768051	-0.164632	-0.108223	-1.16817
502	0.335987	0.686906	0.00831367	0.780618	0.253249	-0.588782	-0.0104415	-0.139042	0.258339	2.08561
870	0.251286	-0.221214	-0.21522	-0.528595	-0.320334	0.484078	0.593479	0.131563	0.152882	-0.333682
357	-0.105762	-0.108175	-0.2422	-1.08681	-0.0790136	-0.367782	0.287205	0.542695	0.064133	-0.670692
1372	0.169969	0.366924	0.399639	-0.0954622	0.0220233	-0.588782	-0.529951	0.233605	-0.260713	-0.130225
620	0.372039	0.110516	-0.259249	-0.0814691	0.294292	0.705151	0.300228	0.227451	0.185972	1.53523
1147	-0.294882	0.477974	0.531971	0.210054	-0.171589	-0.106227	0.0837924	-0.201896	-0.595051	0.659421
1650	-0.882803	0.647945	0.177125	-0.190479	0.644579	0.208487	0.657135	0.227451	-0.701029	-0.00746989
68	NaN	0.831874	-0.113836	-0.190479	-0.475449	-1.90497	-0.991536	0.649424	-0.326514	-0.2205

Like with normal target encoding, our transformed matrix is the same shape as the original:

# Compare sizes
print('Original size:', X_train.shape)
print('Target encoded size:', X_target_encoded_cv.shape)

Original size: (1000, 10)
Target encoded size: (1000, 10)

However, now we have more unique continuous values in each column than we did categories, because we’ve target-encoded the categories separately for each fold (since we used 3 folds, there are about 3 times as many unique values).

# Compare category counts
print('Original:')
print(X_train.nunique())
print('\nTarget encoded:')
print(X_target_encoded_cv.nunique())

Original:
categorical_0    84
categorical_1    81
categorical_2    85
categorical_3    88
categorical_4    84
categorical_5    86
categorical_6    88
categorical_7    88
categorical_8    90
categorical_9    79
dtype: int64
 
Target encoded:
categorical_0    214
categorical_1    203
categorical_2    201
categorical_3    203
categorical_4    208
categorical_5    207
categorical_6    207
categorical_7    205
categorical_8    213
categorical_9    200
dtype: int64

We can fit the same model as before, but now using cross-fold target encoding.

# Regression model
model_te_cv = Pipeline([
    ('encoder', TargetEncoderCV()),
    ('scaler', StandardScaler()),
    ('imputer', SimpleImputer(strategy='mean')),
    ('regressor', BayesianRidge())
])

# Cross-validated MAE
scores = cross_val_score(model_te_cv, X_train, y_train, 
                         cv=3, scoring=mae_scorer)
print('Cross-validated MAE: %0.3f +/- %0.3f'
      % (scores.mean(), scores.std()))

Cross-validated MAE: 0.835 +/- 0.044

Now our model’s error is very low - pretty close to the lower bound of around 0.8! And the cross-validated performance matches the performance on the validation data.

# MAE on test data
model_te_cv.fit(X_train, y_train)
y_pred = model_te_cv.predict(X_test)
test_mae = mean_absolute_error(y_test, y_pred)
print('Validation MAE: %0.3f' % test_mae)

Validation MAE: 0.839

Leave-one-out Target Encoding

We could also prevent the target data leakage by using a leave-one-out scheme. With this method, we compute the per-category means as with the naive target encoder, but we don’t include the current sample in that computation.

This may seem like it will take much longer than the cross-fold method, but it actually ends up being faster, because we can compute the mean without the effect of each sample in an efficient way. Normally the mean is computed with:

\[v = \frac{1}{N_C} \sum_{j \in C} y_j\]

where \( v \) is the target-encoded value for all samples having category \( C \), \( N_C \) is the number of samples having category \( C \), and \( j \in C \) indicates all the samples which have category \( C \).

With leave-one-out target encoding, we can first compute the count of samples having category \( C \) ( \( N_C \) ), and then separately compute the sum of the target values of those categories:

\[S_C = \sum_{j \in C} y_j\]

Then, the mean target value for samples having category \( C \), excluding the effect of sample \( i \), can be computed with

\[v_i = \frac{S_C - y_i}{N_C-1}\]

Let’s build a transformer class which performs the leave-one-out target encoding using that trick.

class TargetEncoderLOO(TargetEncoder):
    """Leave-one-out target encoder.
    """
    
    def __init__(self, cols=None):
        """Leave-one-out target encoding for categorical features.
        
        Parameters
        ----------
        cols : list of str
            Columns to target encode.
        """
        self.cols = cols
        

    def fit(self, X, y):
        """Fit leave-one-out target encoder to X and y
        
        Parameters
        ----------
        X : pandas DataFrame, shape [n_samples, n_columns]
            DataFrame containing columns to target encode
        y : pandas Series, shape = [n_samples]
            Target values.
            
        Returns
        -------
        self : encoder
            Returns self.
        """
        
        # Encode all categorical cols by default
        if self.cols is None:
            self.cols = [col for col in X
                         if str(X[col].dtype)=='object']

        # Check columns are in X
        for col in self.cols:
            if col not in X:
                raise ValueError('Column \''+col+'\' not in X')

        # Encode each element of each column
        self.sum_count = dict()
        for col in self.cols:
            self.sum_count[col] = dict()
            uniques = X[col].unique()
            for unique in uniques:
                ix = X[col]==unique
                self.sum_count[col][unique] = \
                    (y[ix].sum(),ix.sum())
            
        # Return the fit object
        return self

    
    def transform(self, X, y=None):
        """Perform the target encoding transformation.

        Uses leave-one-out target encoding for the training fold,
        and uses normal target encoding for the test fold.

        Parameters
        ----------
        X : pandas DataFrame, shape [n_samples, n_columns]
            DataFrame containing columns to encode

        Returns
        -------
        pandas DataFrame
            Input DataFrame with transformed columns
        """
        
        # Create output dataframe
        Xo = X.copy()

        # Use normal target encoding if this is test data
        if y is None:
            for col in self.sum_count:
                vals = np.full(X.shape[0], np.nan)
                for cat, sum_count in self.sum_count[col].items():
                    vals[X[col]==cat] = sum_count[0]/sum_count[1]
                Xo[col] = vals

        # LOO target encode each column
        else:
            for col in self.sum_count:
                vals = np.full(X.shape[0], np.nan)
                for cat, sum_count in self.sum_count[col].items():
                    ix = X[col]==cat
                    vals[ix] = (sum_count[0]-y[ix])/(sum_count[1]-1)
                Xo[col] = vals
            
        # Return encoded DataFrame
        return Xo
      
            
    def fit_transform(self, X, y=None):
        """Fit and transform the data via target encoding.
        
        Parameters
        ----------
        X : pandas DataFrame, shape [n_samples, n_columns]
            DataFrame containing columns to encode
        y : pandas Series, shape = [n_samples]
            Target values (required!).

        Returns
        -------
        pandas DataFrame
            Input DataFrame with transformed columns
        """
        return self.fit(X, y).transform(X, y)

Using the leave-one-out target encoder, we can target-encode the data like before:

# Cross-fold Target encode the categorical data
te = TargetEncoderLOO()
X_target_encoded_loo = te.fit_transform(X_train, y_train)
X_target_encoded_loo.sample(10)

	categorical_0	categorical_1	categorical_2	categorical_3	categorical_4	categorical_5	categorical_6	categorical_7	categorical_8	categorical_9
915	-0.012727	0.314121	-0.233903	0.175900	0.132491	-1.140074	0.069677	0.256593	-0.430652	-0.103827
1647	0.215025	0.394832	-0.449700	0.600029	-0.201291	0.260293	0.191454	-2.622372	0.165401	NaN
82	-0.025525	-0.097241	0.306800	0.302967	0.211843	-0.075201	0.487799	-0.062378	0.137296	-0.608199
328	-0.022190	0.183942	0.237844	-0.622328	-0.305189	0.444699	0.145842	0.451886	0.397333	0.897512
369	0.612622	-0.219190	0.430788	0.595137	-0.299720	-0.210595	-0.073953	-0.153809	0.342509	0.827628
1488	0.158465	-0.104834	0.117011	1.055839	0.071813	-0.166491	-0.598097	0.394334	-0.909585	-0.650951
800	0.287101	-0.072489	0.402740	-0.044492	0.157021	0.542464	0.601372	0.215462	0.184972	-2.275490
735	0.050074	0.252479	-0.164370	-0.143356	0.126941	0.417575	0.044603	-0.167643	NaN	-0.427960
1361	0.572234	0.578534	-0.286133	0.142657	-0.537536	-0.144131	1.009878	0.019303	-0.524030	0.880222
1355	-0.188658	-0.107702	0.052277	-0.121903	0.130991	-0.264774	0.082149	0.031939	0.136631	0.335117

The transformed matrix is stil the same size as the original:

# Compare sizes
print('Original size:', X_train.shape)
print('Target encoded size:', X_target_encoded_loo.shape)

Original size: (1000, 10)
Target encoded size: (1000, 10)

But now there are nearly as many unique values in each column as there are samples:

# Compare category counts
print('Original:')
print(X_train.nunique())
print('\nLeave-one-out target encoded:')
print(X_target_encoded_loo.nunique())

Original:
categorical_0    84
categorical_1    81
categorical_2    85
categorical_3    88
categorical_4    84
categorical_5    86
categorical_6    88
categorical_7    88
categorical_8    90
categorical_9    79
dtype: int64
 
Leave-one-out target encoded:
categorical_0    993
categorical_1    994
categorical_2    992
categorical_3    987
categorical_4    990
categorical_5    990
categorical_6    990
categorical_7    991
categorical_8    992
categorical_9    996
dtype: int64

Also, there are less empty values in the leave-one-out target encoded dataframe than there were in the cross-fold target encoded dataframe. This is because with leave-one-out target encoding, a value will only be null if it is the only category of that type (or if the original feature value was null).

# Compare null counts
print('Original null count:')
print(X_train.isnull().sum())
print('\nCross-fold target encoded null count:')
print(X_target_encoded_cv.isnull().sum())
print('\nLeave-one-out target encoded null count:')
print(X_target_encoded_loo.isnull().sum())

Original null count:
categorical_0    0
categorical_1    0
categorical_2    0
categorical_3    0
categorical_4    0
categorical_5    0
categorical_6    0
categorical_7    0
categorical_8    0
categorical_9    0
dtype: int64
 
Cross-fold target encoded null count:
categorical_0     9
categorical_1    12
categorical_2    22
categorical_3    21
categorical_4    12
categorical_5    15
categorical_6    19
categorical_7    20
categorical_8    23
categorical_9    15
dtype: int64
 
Leave-one-out target encoded null count:
categorical_0     7
categorical_1     6
categorical_2     8
categorical_3    13
categorical_4    10
categorical_5    10
categorical_6    10
categorical_7     9
categorical_8     8
categorical_9     4
dtype: int64

But more importantly, how well can our model predict the target variable when trained on the leave-one-out target encoded data?

# Regression model
model_te_loo = Pipeline([
    ('encoder', TargetEncoderLOO()),
    ('scaler', StandardScaler()),
    ('imputer', SimpleImputer(strategy='mean')),
    ('regressor', BayesianRidge())
])

# Cross-validated MAE
scores = cross_val_score(model_te_loo, X_train, y_train, 
                         cv=3, scoring=mae_scorer)
print('Cross-validated MAE: %0.3f +/- %0.3f'
      % (scores.mean(), scores.std()))

Cross-validated MAE: 0.833 +/- 0.038

# MAE on test data
model_te_loo.fit(X_train, y_train)
y_pred = model_te_loo.predict(X_test)
test_mae = mean_absolute_error(y_test, y_pred)
print('Validation MAE: %0.3f' % test_mae)

Validation MAE: 0.838

The leave-one-out target encoder performs slightly better than the cross-fold target encoder, because we’ve given it more samples with which to compute the per-category means (\( N-1 \), instead of \( N-N/K \), where K is the number of folds). While the increase in performance was very small, the leave-one-out target encoder is faster, due to the effecient way we computed the leave-one-out means (instead of having to compute means for each fold).

%%time
Xo = TargetEncoderCV().fit_transform(X_train, y_train)

CPU times: user 6.73 s, sys: 118 ms, total: 6.85 s
Wall time: 6.76 s

%%time
Xo = TargetEncoderLOO().fit_transform(X_train, y_train)

CPU times: user 4.25 s, sys: 25 ms, total: 4.27 s
Wall time: 4.27 s

Effect of the Learning Algorithm

The increase in predictive performance one gets from target encoding depends on the machine learning algorithm which is using it. As we’ve seen, target encoding is great for linear models (throughout this post we were using a Bayesian ridge regression, a variant on a linear regression which optimizes the regularization parameter). However, target encoding doesn’t help as much for tree-based boosting algorithms like XGBoost, CatBoost, or LightGBM, which tend to handle categorical data pretty well as-is.

Fitting the Bayesian ridge regression to the data, we see a huge increase in performance after target encoding (relative to one-hot encoding).

# Bayesian ridge w/ one-hot encoding
model_brr = Pipeline([
    ('encoder', OneHotEncoder()),
    ('scaler', StandardScaler()),
    ('imputer', SimpleImputer(strategy='mean')),
    ('regressor', BayesianRidge())
])

# Cross-validated MAE
scores = cross_val_score(model_brr, X_train, y_train, 
                         cv=3, scoring=mae_scorer)
print('MAE w/ Bayesian Ridge + one-hot encoding: %0.3f +/- %0.3f' 
      % (scores.mean(), scores.std()))

MAE w/ Bayesian Ridge + one-hot encoding: 1.039 +/- 0.028

# Bayesian ridge w/ target-encoding
model_brr = Pipeline([
    ('encoder', TargetEncoderLOO()),
    ('scaler', StandardScaler()),
    ('imputer', SimpleImputer(strategy='mean')),
    ('regressor', BayesianRidge())
])

# Cross-validated MAE
scores = cross_val_score(model_brr, X_train, y_train, 
                         cv=3, scoring=mae_scorer)
print('MAE w/ Bayesian Ridge + target encoding: %0.3f +/- %0.3f' 
      % (scores.mean(), scores.std()))

MAE w/ Bayesian Ridge + target encoding: 0.833 +/- 0.038

However, using XGBoost, there is only a modest perfomance increase (if any at all).

# Regression model
model_xgb = Pipeline([
    ('encoder', OneHotEncoder()),
    ('scaler', StandardScaler()),
    ('imputer', SimpleImputer(strategy='mean')),
    ('regressor', XGBRegressor())
])

# Cross-validated MAE
scores = cross_val_score(model_xgb, X_train, y_train, 
                         cv=3, scoring=mae_scorer)
print('MAE w/ XGBoost + one-hot encoding: %0.3f +/- %0.3f'
      % (scores.mean(), scores.std()))

MAE w/ XGBoost + one-hot encoding: 0.869 +/- 0.040

# Regression model
model_xgb = Pipeline([
    ('encoder', TargetEncoderLOO()),
    ('scaler', StandardScaler()),
    ('imputer', SimpleImputer(strategy='mean')),
    ('regressor', XGBRegressor())
])

# Cross-validated MAE
scores = cross_val_score(model_xgb, X_train, y_train, 
                         cv=3, scoring=mae_scorer)
print('MAE w/ XGBoost + target encoding: %0.3f +/- %0.3f'
      % (scores.mean(), scores.std()))

MAE w/ XGBoost + target encoding: 0.864 +/- 0.052

Dependence on the Number of Categories

There is also an effect of the number of categories on the performance of a model trained on target-encoded data. Target encoding works well with categorical data that contains a large number of categories. However, if you have data with only a few categories, you’re probably better off using one-hot encoding.

For example, let’s generate two datasets: one which has a large number of categories in each column, and another which has only a few categories in each column.

# Categorical data w/ many categories
X_many, y_many = make_categorical_regression(
    n_samples=1000, 
    n_features=10, 
    n_categories=100,
    n_informative=1,
    imbalance=2.0)

# Categorical data w/ few categories
X_few, y_few = make_categorical_regression(
    n_samples=1000, 
    n_features=10, 
    n_categories=5,
    n_informative=1,
    imbalance=2.0)

Then we’ll construct two separate models: one which uses target-encoding, and another which uses one-hot encoding.

# Regression model w/ target encoding
model_te = Pipeline([
    ('encoder', TargetEncoderLOO()),
    ('scaler', StandardScaler()),
    ('imputer', SimpleImputer(strategy='mean')),
    ('regressor', BayesianRidge())
])

# Regression model w/ one-hot encoding
model_oh = Pipeline([
    ('encoder', OneHotEncoder()),
    ('scaler', StandardScaler()),
    ('imputer', SimpleImputer(strategy='mean')),
    ('regressor', BayesianRidge())
])

On the dataset with many categories per column, target-encoding outperforms one-hot encoding by a good margin.

print('Many categories:')

# Target encoding w/ many categories
scores = cross_val_score(model_te, X_many, y_many, 
                         cv=3, scoring=mae_scorer)
print('MAE w/ target encoding: %0.3f +/- %0.3f' 
      % (scores.mean(), scores.std()))

# One-hot encoding w/ many categories
scores = cross_val_score(model_oh, X_many, y_many, 
                         cv=3, scoring=mae_scorer)
print('MAE w/ one-hot encoding: %0.3f +/- %0.3f' 
      % (scores.mean(), scores.std()))

Many categories:
MAE w/ target encoding: 0.820 +/- 0.029
MAE w/ one-hot encoding: 1.049 +/- 0.045

On the other hand, with the dataset containing only a few categories per column, the performance of the one-hot encoded model is nearly indistinguishable from the performance of the model which uses target encoding.

print('Few categories:')

# Target encoding w/ few categories
scores = cross_val_score(model_te, X_few, y_few, 
                         cv=3, scoring=mae_scorer)
print('MAE w/ target encoding: %0.3f +/- %0.3f' 
      % (scores.mean(), scores.std()))

# One-hot encoding w/ few categories
scores = cross_val_score(model_oh, X_few, y_few, 
                         cv=3, scoring=mae_scorer)
print('MAE w/ one-hot encoding: %0.3f +/- %0.3f' 
      % (scores.mean(), scores.std()))

Few categories:
MAE w/ target encoding: 0.815 +/- 0.030
MAE w/ one-hot encoding: 0.830 +/- 0.025

Effect of Category Imbalance

I would have expected target encoding to perform better than one-hot encoding when the categories were extremely unbalanced (most samples have one of only a few categories), and one-hot encoding to outperform target encoding in the case of balanced categories (categories appear about the same number of times thoughout the dataset). However, it appears that category imbalance effects both one-hot and target encoding similarly.

Let’s generate two datasets, one of which has balanced categories, and another which has highly imbalanced categories in each column.

# Categorical data w/ many categories
X_bal, y_bal = make_categorical_regression(
    n_samples=1000, 
    n_features=10, 
    n_categories=100,
    n_informative=1,
    imbalance=0.0)

# Categorical data w/ few categories
X_imbal, y_imbal = make_categorical_regression(
    n_samples=1000, 
    n_features=10, 
    n_categories=100,
    n_informative=1,
    imbalance=2.0)

Fitting the models from the previous section (one of which uses target encoding and the other uses one-hot encoding), we see that how imbalanced the data is doesn’t have a huge effect on the perfomance of the model which uses target encoding.

print('Target encoding:')

# Target encoding w/ imbalanced categories
scores = cross_val_score(model_te, X_imbal, y_imbal, 
                         cv=5, scoring=mae_scorer)
print('MAE w/ imbalanced categories: %0.3f +/- %0.3f' 
      % (scores.mean(), scores.std()))

# Target encoding w/ balanced categories
scores = cross_val_score(model_te, X_bal, y_bal, 
                         cv=5, scoring=mae_scorer)
print('MAE w/ balanced categories: %0.3f +/- %0.3f' 
      % (scores.mean(), scores.std()))

Target encoding:
MAE w/ imbalanced categories: 0.873 +/- 0.054
MAE w/ balanced categories: 0.845 +/- 0.041

Nor does it appear to have a big effect on the performance of the model which uses one-hot encoding.

print('One-hot encoding:')

# One-hot encoding w/ imbalanced categories
scores = cross_val_score(model_oh, X_imbal, y_imbal, 
                         cv=5, scoring=mae_scorer)
print('MAE w/ imbalanced categories: %0.3f +/- %0.3f' 
      % (scores.mean(), scores.std()))

# One-hot encoding w/ balanced categories
scores = cross_val_score(model_oh, X_bal, y_bal, 
                         cv=5, scoring=mae_scorer)
print('MAE w/ balanced categories: %0.3f +/- %0.3f'
      % (scores.mean(), scores.std()))

One-hot encoding:
MAE w/ imbalanced categories: 1.030 +/- 0.024
MAE w/ balanced categories: 0.993 +/- 0.029

I’ve tried various combinations of predictive models, levels of imbalance, and numbers of categories, and the level of imbalance doesn’t seem to have a very systematic effect. I suspect this is because for both target encoding and one-hot encoding, with balanced categories we have more information about all categories on average (because examples with each category are more evenly distributed). On the other hand, we have less information about the most common categories - because those categories are no more “common” than any other in a balanced dataset. Therefore, the level of uncertainty for those categories ends up actually being higher for balanced datasets. Those two effects appear to cancel out, and the predictive performance of our models don’t change.

Effect of Interactions

So far, target encoding has performed as well or better than other types of encoding. However, there’s one situation where target encoding doesn’t do so well: in the face of strong interaction effects.

An interaction effect is when the effect of one feature on the target variable depends on the value of a second feature. For example, suppose we have one categorical feature with categories A and B, and a second categorical feature with categories C and D. With no interaction effect, the effect of the first and second feature would be additive, and the effect of A and B on the target variable is independent of C and D. An example of this is the money spent as a function of items purchased. If a customer purchases both items 1 and 2, they will be charged the same as if they had purchased either item independently:

plt.bar(np.arange(4), [0, 2, 3, 5])
plt.ylabel('Cost')
plt.xticks(np.arange(4), 
           ['No purchases', 
            'Purchased only item 1', 
            'Purchased only item 2', 
            'Purchased both 1 + 2'])

svg

On the other hand, if there is an interaction effect, the effect on the target variable will not be simply the sum of the two features’ effects. For example, just adding sugar or stirring coffee may not have a huge effect on the sweetness of the coffee. But if one adds sugar and stirs, there is a large effect on the sweetness of the coffee.

plt.bar(np.arange(4), [1, 1, 3, 10])
plt.ylabel('Coffee sweetness')
plt.xticks(np.arange(4), 
           ['Nothing', 
            'Stir', 
            'Sugar', 
            'Sugar + stir'])

svg

Target encoding simply fills in each category with the mean target value for samples having that category. Because target encoding does this for each column individually, it’s fundamentally unable to handle interactions between columns! That said, one-hot encoding doesn’t intrinsically handle interaction effects either - it depends on the learning algorithm being used. Linear models (like the Bayesian ridge regression we’ve been using) can’t pull out interaction effects unless we explicitly encode them (by adding a column for each possible interaction). Nonlinear learning algorithms, such as decision tree-based models, SVMs, and neural networks, are able to detect interaction effects in the data as-is.

To see how well interaction effects are captured by models trained on target-encoded or one-hot-encoded data, we’ll create two categorical datasets: one which has no interaction effects, and one whose variance is completely explained by interaction effects (and noise).

# Categorical data w/ no interaction effects
X_no_int, y_no_int = make_categorical_regression(
    n_samples=1000, 
    n_features=10,
    n_categories=100,
    n_informative=2,
    interactions=0.0)

# Categorical data w/ interaction effects
X_inter, y_inter = make_categorical_regression(
    n_samples=1000, 
    n_features=10,
    n_categories=100,
    n_informative=2,
    interactions=1.0)

To capture interaction effects, we’ll have to use a model which can handle interactions, such as a tree-based method like XGBoost (a linear regression can’t capture interactions unless they are explicitly encoded).

# Regression model w/ target encoding
model_te = Pipeline([
    ('encoder', TargetEncoderLOO()),
    ('scaler', StandardScaler()),
    ('imputer', SimpleImputer(strategy='mean')),
    ('regressor', XGBRegressor())
])

# Regression model w/ one-hot encoding
model_oh = Pipeline([
    ('encoder', OneHotEncoder()),
    ('scaler', StandardScaler()),
    ('imputer', SimpleImputer(strategy='mean')),
    ('regressor', XGBRegressor())
])

As we’ve seen before, without interaction effects the target encoder performs better than the one-hot encoder.

print('No interaction effects:')

# Target encoding w/ no interaction effects
scores = cross_val_score(model_te, X_no_int, y_no_int, 
                         cv=5, scoring=mae_scorer)
print('MAE w/ target encoding: %0.3f +/- %0.3f'
      % (scores.mean(), scores.std()))

# One-hot encoding w/ no interaction effects
scores = cross_val_score(model_oh, X_no_int, y_no_int, 
                         cv=5, scoring=mae_scorer)
print('MAE w/ one-hot encoding: %0.3f +/- %0.3f' 
      % (scores.mean(), scores.std()))

No interaction effects:
MAE w/ target encoding: 1.013 +/- 0.033
MAE w/ one-hot encoding: 1.155 +/- 0.029

However, when most of the variance can be explained by interaction effects, the model trained on one-hot encoded data performs better (or at least it’s unlikely that the target-encoded model has better performance).

print('With interaction effects:')

# Target encoding w/ interaction effects
scores = cross_val_score(model_te, X_inter, y_inter, 
                         cv=5, scoring=mae_scorer)
print('MAE w/ target encoding: %0.3f +/- %0.3f' 
      % (scores.mean(), scores.std()))

# One-hot encoding w/ interaction effects
scores = cross_val_score(model_oh, X_inter, y_inter, 
                         cv=5, scoring=mae_scorer)
print('MAE w/ one-hot encoding: %0.3f +/- %0.3f' 
      % (scores.mean(), scores.std()))

With interaction effects:
MAE w/ target encoding: 1.222 +/- 0.035
MAE w/ one-hot encoding: 1.189 +/- 0.009

Suggestions

Target encoding categorical variables is a great way to represent categorical data in a numerical format that machine learning algorithms can handle, without jacking up the dimensionality of your training data. However, make sure to use cross-fold or leave-one-out target encoding to prevent data leakage! Also keep in mind the number of categories, what machine learning algorithm you’re using, and whether you suspect there may be strong interaction effects in your data. With only a few categories, or in the presence of interaction effects, you’re probably better off just using one-hot encoding and a boosting algorithm like XGBoost/CatBoost/LightGBM. On the other hand, if your data contains many columns with many categories, it might be best to use target encoding!

Brendan Hasz