SAS Custom Macros That Make Feature Engineering Easy for Data Scientists, Data Engineers and Machine Learning Specialists

suraj saini (Amar)
5 min readDec 16, 2020

1. One_hot encoding

There are many different definitions of one hot encoding over the internet. In general words, It is a process of converting categorical variables into a binary form (1,0) variables, which can be feed to Machine Learning, Deep learning and Statistical Algorithms to make better prediction or to improve the efficiency of the ML/DL/Statistical models.

A. Here is the HOT_encode SAS Custom Macro Explanation

%hot_encode (Dataset_name, Variable_name)

1. Dataset_name= specify a dataset with sas library name for instance: sashelp.cars.

2. Variable_name= Name of the categorical variable that you want to one_hot_encode.

B. What does hot_encode () do behind the scene?

It creates new dataset “encoded_data” in the work library with new encoded (Binary form) variables that you can use to train your ML Models. But remember tables in work library are temporary table, you can either save this table in your permanent library or you can merge the new variable with your existing dataset as per your choice.

C. Example of hot_encod ()

D. SAS Macro Definition Code for One Hot Encoding

2. Outlier Detection Method

There are a plethora of methods and algorithms to find outliers and extreme values in the dataset. The custom SAS Macro that I build will check normality test and then decide whether to use Standard Deviation or Percentiles method to find out the extreme values in the dataset.

If a Variable normally distributed, by default it will use the standard deviation method to find outliers, otherwise, it will use the Percentile method.

A. Outliers SAS Custom Macro Explanation

%Outliers (Dataset_name, Variable_name)

1. Dataset_name= specify a dataset with sas library name for instance: sashelp.cars.

2. Variable_name= Name of a variable in which you want to find outliers.

B. What does Outliers () do behind the scene?

It creates new dataset “Outliers” in the work library with only observations that are considered to be extreme/outliers. It will run a normality test and create test statistics in the ‘Test’ table that you can find in the work library just after the execution of this Macro. If your variable is normally distributed then it will consider all those observations as outliers which are falling either above or below 3 standard deviations of the mean.

If your variable is not normally distributed, it will deploy the percentile method to find outliers. The outliers will be all the observations which are above 99 percentile or below 1 percentile (‘Range’ table will be created in the work library if you want to see the values of mean and percentiles). You can change the benchmark to decide observations as outliers in the SAS Macro definition as per your requirements.

C. Examples of Outliers ()

D. SAS Macro Definition Code for Outliers Detections

3. Lag Features

Normally lag features are derived from time-series dataset, lag feature contains a data value from a past time or past value. “Lag features are the classical way that time series forecasting problems are transformed into supervised learning problems.” https://machinelearningmastery.com/basic-feature-engineering-time-series-data-python/

A. Lag_n SAS Custom Macro Explanation

%Lag_n (Dataset_name, Variable_name, start_time_period, end_time_period)

1. Dataset_name= specify a dataset with sas library name for instance: sashelp.cars.

2. Variable_name= Name of a variable from which you want to drive Lag Features.

3. Start_time_period= Specify a non-negative integer value 0, 1, 2, 3 etc. But this value will be the Lag_ (integer), which means if you give a value 2 it will start deriving lag features from lag_2.

4. End_time_period= Specify a non-negative integer value 0, 1, 2, 3 etc. But this value will be the Lag_ (integer), which means if you give a value 3 it will stop deriving lag features until lag_3.

B. What does Lag_n () do behind the scene?

It will create a new temporary dataset with name “WithLag”. It will create n numbers of lag variables, you can control the n value with lower and upper limit of the time period, which you can specify in the start_time_period and end_time_period arguments of the Lag_n Macro respectively.

C. Examples of Lag_n ()

D. SAS Macro Definition Code Lag Features

4. Describe Table

Describe table SAS custom macro give you a lot more information than the Python Pandas Describe() function https://www.tutorialspoint.com/python_pandas/python_pandas_descriptive_statistics.htm#:~:text=The%20describe()%20function%20computes,pertaining%20to%20the%20DataFrame%20columns.&text=This%20function%20gives%20the%20mean,given%20summary%20about%20numeric%20columns. It will gives you the all required information about the dataset that a Data Analyst/ Data Scientist needs to know.

A. Here is the describe_table () SAS Custom Macro Explanation

%describe_table (Dataset_name)

  1. Dataset_name= specify a dataset with sas library name for instance: sashelp.cars.

B. What does describe_table () do behind the scene?

It creates a report that includes all information related to the data variable such as types, formats, length, labels etc. Second, it will give you the summary statistics table of all the numerical features, apart from that, it will also create frequency tables and frequency plots for all the categorical features.

C. Examples of describe_table ()

D. SAS Custom Macro Definition Code for Table Description

Actual Post is here you can click on the link below

https://seleritysas.com/blog/2020/12/10/sas-custom-macros-that-make-feature-engineering-easy-for-data-scientists-data-engineers-and-machine-learning-specialists/

--

--

suraj saini (Amar)

SAS Certified Programming Specialist, passionate about Machine Learning, Feature Engineering and Data Science.