Categorical Feature Encoding in SAS (Bayesian Encoders) — By Suraj Saini

suraj saini (Amar)
Analytics Vidhya
Published in
3 min readFeb 19, 2021

--

What is Bayesian Encoding?

Bayesian Encoding is a type of encoding that takes into account intra-category variation and the target mean when encoding categorical variables. It is a type of targeted encoding that comes with several advantages. For example, Bayesian Encoding requires minimal effort compared to other encoding methods.

In this blog post, we talk about the different Bayesian encoding techniques and how they work.

1. Target/Mean Encoding

Target or Mean Encoding is one of the most commonly used encoding techniques in Kaggle competitions.

Target encoding is where each class value of the categorical variable is replaced by the mean value of the target variable, with respect to the categorical class in the training dataset.

Hence, we have to specify the target variable in the SAS Mean Encoding Macro, as shown in the code below.

Check out this link to know more information about categorical variable encoding.

SAS Macro for Target/Mean Encoding

%macro mean_encoding(dataset,var,target); 
proc sql;
create table mean_table as select distinct(&var) as gr, round(mean(&target),00.1) As mean_encode
from &dataset group by gr;
create table new as select d.* , m.mean_encode
from &dataset as d left join mean_table as mon &var=m.gr;
quit;
%mend;

2. Weight of Evidence Encoding

“Weight of Evidence (WoE) is a measure of the “ strength “ of a grouping technique that is used to separate good and bad. This method was developed primarily to build a predictive model to evaluate the risk of loan default in the credit and financial industry.

WoE will be 0 if the P(Goods) / P(Bads) = 1. That is if the outcome is random for that group. If P(Bads) > P(Goods), the odds ratio will be < 1, and the WoE will be < 0. If, on the other hand, P(Goods) > P(Bads) in a group, then WoE > 0.

WoE is well suited for Logistic Regression because the logit transformation is simply the log of the odds, i.e. in(P(Goods)/P(Bads)). Therefore, by using WoE-coded predictors in Logistic Regression, the predictors are all prepared and coded to the same scale. The parameters in the linear logistic regression equation can be directly compared” the actual definition is here encoding.

SAS Macro for Weight of Evidence Encoding

%macro woe_encoding(dataset,var,target); 
proc sql noprint; create table stats as select distinct(&var) as gr, round(mean(&target),00.1) as mean_encode from &dataset group by gr; quit;
data stats;
set stats;
bad_prob=1-mean_encode;
if bad_prob=0 then bad_prob=0.0001;
me_by_bp=mean_encode/bad_prob;
woe_encode=log(me_by_bp);
run;
proc sql noprint;
create table new as select d.* , s.woe_encode
from &dataset as d left join stats as s on &var=s.gr;
quit;
%mend;

3. Probability Ratio Encoding

“Probability Ratio Encoding” is similar to Weight Of Evidence, the only difference is the ratio of good and bad probability being used. For each label, we calculate the mean of target=1, that is, the probability of being 1 ( P(1) ), and also the probability of the target=0 ( P(0) ). Then, we calculate the ratio P(1)/P(0) and replace the labels with that ratio.

We need to add a minimal value with P(0) to avoid any divide by zero scenarios where for any particular category, there is no target=0. Check out this link for more information.

SAS Macro for Probability Ratio Encoding

%macro probability_encoding(dataset,var,target);
proc sql noprint; create table stats as select distinct(&var) as gr, round(mean(&target),00.1) as mean_encode from &dataset
group by gr;
quit;
data stats;
set stats;
bad_prob=1-mean_encode;
if bad_prob=0 then bad_prob=0.0001; prob_encode=mean_encode/bad_prob;
run;
proc sql noprint;
create table new as select d.* , s.prob_encode
from &dataset as d left join stats as s on &var=s.gr;
quit;
%mend;

Wrapping Up

Categorical Feature Encoding is an important part of cleaning up data for machine learning models. However, each method works in different circumstances so it is important to know about different techniques that fall under the Bayesian category.

If you want to take a look at how the coding operates in a SAS environment, you can find all the SAS Macro Definition code on my GitHub page here.

Originally published at https://seleritysas.com on February 19, 2021.

--

--

suraj saini (Amar)
Analytics Vidhya

SAS Certified Programming Specialist, passionate about Machine Learning, Feature Engineering and Data Science.