Post 7 | ML| Data Preprocessing – Part 6

Hello, everyone. Welcome to the last tutorial in the Data Preprocessing portion of the Machine Learning tutorials. In the last tutorial, we saw how to create the TRAINING and TEST data sets for model building purposes. In this tutorial, we are going to see why and how to perform the Feature Scaling.

Let us begin, then. To refresh your memory, the following is the info graphics about our progress so far.

Machine Learning: Data Preprocessing
Machine Learning: Data Preprocessing

This is one of the most important topics in the data preprocessing section as it enables us to make some sense out of the available data.

We will begin with explaining what is Feature Scaling, why to do it, and finally, how to do it in both Python and R.

  • WHAT IS FEATURE SCALING?

In simple words, Feature Scaling is converting two or more variables in the same range. It is highly recommended to perform Feature Scaling as one of the steps in Data Preprocessing. We do it to bring the concerned variables in the same range.

I know the following two ways to perform feature scaling.

  • Standardization

We can perform Feature Scaling by converting the input variable into the standardized variable. This is called Standardization. The formula for doing this is as follows.

new_var = ( old_var – mean(old_var) ) / stddev(old_var)

where,
new_var -> the new variable to be created
old_var -> the original variable to be standardized
mean(old_var) -> mean of the original variable
stddev(old_var) -> standard deviation of the original variable

  • Normalization

This is the other method of performing the Feature Scaling. Instead of dealing with parameters such as mean and the standard deviation of the variable, we use the minimum and maximum values of the variable. By doing this, we try to narrow down the gaps between multiple variables and try to bring them closer together i.e. within the same scale.

The formula for Normalization is as follows.

new_var = ( old_var – min(old_var) ) / (max(old_var) – min(old_var))

where,
new_var -> the new variable to be created
old_var -> the original variable to be standardized
min(old_var) -> the minimum value of the original variable
max(old_var) -> the maximum value of the original variable

Now, if these two concepts are clear, let us see why we actually perform Feature Scaling.

  • WHY DO WE PERFORM FEATURE SCALING?

As we all know, Machine Learning models are entirely dependent on numerical information. The variables which are part of these models are nothing but numbers. Even if you pass the variables containing the string or character information, these models will internally convert those variables in numeric format and perform their prediction task.

The problem with this approach is that if the scale of two variables is not the same, then the variable with higher integer value than the other will have more impact on the model performance as compared to the other one, which might hamper the prediction task. This approach might not work every time.

For example, in the current scenario, the two variables in question are SALARY and AGE. The range and scale of these two variables are not the same and differ a lot. Since SALARY has higher integer values, it will have more impact on the model performance as compared to the AGE variable. But, it is not the case and the domain knowledge suggests that both SALARY and AGE have equal importance in model building.

The following input data snippet shows you the difference in the scale between SALARY and AGE variables.

Input CSV file
Input CSV file

Now that we have the first-hand knowledge of What and Why of Feature Scaling. Let us answer the How part of the question.

We will perform the Feature Scaling in both Python and R.

  • FEATURE SCALING USING PYTHON

For performing Feature Scaling in Python, StandardScaler package is used in the preprocessing library which is a part of the sklearn library. We can import this package with the help of the following command.

from sklearn.preprocessing import StandardScaler

After doing the import, we have to create the object of StandardScaler and that object is then used to perform the Feature Scaling operations. We can use the following command to create this object.

sc_X = StandardScaler()

The newly created object sc_X is used for performing the Feature Scaling. We will perform this operation on X_train and X_test variables. But, before doing that, let us take a look at these variables.

Train and Test dataset
Train and Test dataset

As you can see from the above screenshot, the scale of AGE and SALARY variables is significantly different. Now, let us try to scale these variables.

We use the following commands to perform this Feature Scaling.

X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

You might be wondering why there is a difference between two commands in the above code snippet. The reason behind this is that we have already done the FIT operation on the sc_X object, therefore we did not need to perform the FIT operation again and directly TRANSFORMATION can be applied to the sc_X object.

If the above explanation is understood, please have a look at the following screenshot which shows the values of X_train and X_test variables after performing the Feature Scaling.

Feature Scaling in Python
Feature Scaling in Python

As can be seen from the above screenshot, SALARY and AGE variables (column 3 and 4) are converted to the same scale, which is good for Machine Learning models.

This completes the Feature operation using Python. Let us perform the same thing using R.

  • FEATURE SCALING USING R

Performing the Feature Scaling in R is easier as compared to Python.

Since we are going to scale only SALARY and AGE variables, while performing the Feature Scaling, we must pass these variables. If mistakenly, you pass a variable with string or character values, then scale() function returns an error saying ‘x’ must be numeric.

We use the following command to perform this Feature Scaling.

training_set[, c(2,3)] = scale(training_set[, c(2,3)])
test_set[, c(2,3)] = scale(test_set[, c(2,3)])

As you can see, the indices 2 and 3 are for AGE and SALARY variables, respectively.

The output of the above two commands is as follows.

Feature Scaling in R
Feature Scaling in R

As you can see, the variables AGE and SALARY are of the same scale now, as expected.

This completes the Feature Scaling using R.

And now, with this, I am glad to inform that we are done with the data preprocessing part of the Machine Learning tutorials.

In the next tutorial, we are going to start off with the REGRESSION. In this portion, we are going to focus on the following things.

  1. Simple Linear Regression
  2. Multiple Linear Regression
  3. Polynomial Regression
  4. Support Vector Regression (SVR)
  5. Decision Tree Regression
  6. Random Forest Regression
  7. Evaluating Regression Models Performance

Hope you guys like the contents. Please stay tuned and follow my blog.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: