Hello, everyone. Welcome to the Part 4 of the Data Preprocessing of the Machine Learning tutorials. In the last tutorial, we saw how to impute the Missing Data in both Python and R. In this tutorial, we are going to see how to deal with the qualitative entries in the given data.
The following infographics show our progress in the Data Preprocessing section of the Machine Learning.
So we are almost there with the Data Preprocessing portion.
Let us start digging deep to know how to deal with Categorical Variables for building the Machine Learning Models.
There are plenty of reasons because of which we try to convert the Categorical Variables into the Continuous Variables. The most important reason for this is that it helps the Machine Learning Models to build and perform faster. I have worked on tools such as SAS-JMP and SAS-EM. The good/bad thing about these tools is that you get the work done, but you don’t get to know how it is actually doing it. Therefore, we should follow the traditional approach of building the Machine Learning models from the scratch so that we will have the total control over the outcome of the model.
Having said that, let us start off with the Categorical Variables handing in both Python in R.
- CATEGORICAL VARIABLES CONVERSION IN PYTHON
For doing this very important task in Python, we have two libraries. The first one is “LabelEncoder” and the second is “OneHotEncoder“. Both of these libraries are part of the “sklearn.preprocessing” package in Python.
Let us first start with LabelEncoder library.
You can import this library by running the following command.
from sklearn.preprocessing import LabelEncoder
Once LabelEncoder is imported, you can create an object for it as shown below.
labelencoder_X = LabelEncoder()
Once the labelencoder_X object is created, we need to perform the FIT and TRANSFORM operation on the column which contains the qualitative data. To refresh your memory, our data set looks like this.
As can be seen, we need to transform COUNTRY and PURCHASED columns and convert these CATEGORICAL VARIABLES into CONTINUOUS VARIABLES. It means columns 0 and 3 (index) needs transformation. The following command is used for doing this transformation using LabelEncoder library.
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
The problem with the above approach is that it assigns value 0, 1, and 2 to three different countries, which is correct. But, it doesn’t make sense while building the Machine Learning models, because, when the model sees values such 0, 1, and 2, it does not understand the context of these values and treats those as integer values. Therefore, making the coefficients NULL for value 0 and giving more importance to coefficients with value 2 as compared to value 1.
Theoretically, this might make sense, but if we look at the data, then that is hardly the case. Here, we are trying to convert the CITY name into an integer, therefore every integer value has equal importance, which should not be violated by assigning values such as 0, 1, and 2.
This can be done in a better way and that is known as the DUMMY VARIABLE CREATION.
In this, instead of having only one column with three different values, we have three columns with either of the two values i.e. 0 or 1. This makes it easier to figure out the entry for that particular column.
Let us see how to perform the DUMMY VARIABLE CREATION in Python. We use the OneHotEncoder package in sklearn.preprocessing package in Python for doing this.
You can import this package in Python using the following command.
from sklearn.preprocessing import OneHotEncoder
Once the package is imported, we need to create the object of OneHotEncoder class. You can use the following command to create this object.
onehotencoder = OneHotEncoder(categorical_features = )
Here, as you can see, we are passing the categorical feature variable present at index 0 i.e. the first (COUNTRY) column.
Once the object is created, we can perform the FIT and TRANSFORM operation to convert the CATEGORICAL variable into the CONTINUOUS variable. The following command is used for performing this type conversion.
X = onehotencoder.fit_transform(X).toarray()
As you can see the above command, the variable X, which is already defined as the COUNTRY variable in the earlier section of this tutorial, is sent to FIT and TRANSFORM operation, followed by the conversion to an array. This converts the COUNTRY variable from a CATEGORICAL variable to a CONTINUOUS variable.
Now, let us convert the Response variable Y in the similar fashion.
As you can see from the below data snippet, the response variable D contains only two types of values: No and Yes.
When a column contains two types of values, it is recommended to use LabelEncoder as compared to OneHotEncoder. Hence, for converting the response variable Responded, we are going to use the following command.
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
In the above commands, y is the RESPONSE variable which we have already defined in the initial section of this tutorial. fit_transform() method is responsible for converting the RESPONSE variable y from the CATEGORICAL type to the CONTINUOUS type.
The following screenshot gives you an idea about the above code execution results.
As you can see from the above screenshot, both the COUNTRY and the RESPONDED variable was converted to the CONTINUOUS variables with values either 0 or 1.
This completes the categorical variable conversion in Python.
Now, let us see how the very process can be done using R programming language.
- CATEGORICAL VARIABLES CONVERSION IN R
In R, surprisingly, it is very easy to perform this variable type conversion. Both predictors and response variables are converted to continuous variables using the factor() function. The difference between Python and R is that in R, you need to specify which STRING value should be converted to what INTEGER value. For example,
the country “France” should be converted to what integer value such as 1, 2, 3, and so on.
If you remember, our input data (before the missing data imputation) looks like this in RStudio.
Here, the variables Country and Purchased are converted from the CATEGORICAL STRING format to the INTEGER format with the help of the following two lines.
dataset$Country = factor( x = dataset$Country, levels = c(‘France’, ‘Spain’, ‘Germany’), labels = c(1, 2, 3) )
dataset$Purchased = factor( x = dataset$Purchased, levels = c(‘No’, ‘Yes’), labels = c(0, 1) )
The above two commands do execute and give us the following output.
This completes the Categorical variable conversion using R.
Hope the contents and screenshots help you to understand these Machine Learning Data Preprocessing concepts.