Hello, everyone. Thanks for coming back for the third part of the Data Preprocessing section of the Machine Learning tutorial series.
In the last tutorial, i.e. Part 2, we saw how to import the downloaded dataset. In this tutorial, we are going to see how to impute the missing data in the input data.
The following infographics might come handy to refresh your memory.
As you can see, after this tutorial, we are left with the Categorical variable conversion, the creation of train and test dataset, and feature scaling. Let us start with the missing data imputation, then.
It is time for some theory, and then we will how to implement it in both Python and R.
- BEST PRACTICES: MISSING DATA IMPUTATION
In simple words, imputing a missing data means replacing it with some other value. Generally, following are the best practices when it comes to missing data imputation.
- Mean Substitution: replacing missing data with the mean of the column
- Median Substitution: replacing missing data with the median of the column
- Regression Substitution
- Multiple Imputation Methods
Apart from above four, there are a number of other methods which can be used for doing the imputation of the missing data.
For this tutorial, we are going to use the first method of the Missing Data Imputation listed above. We will replace the missing data with the mean/average of the column, which makes sense over the other methods. Obviously, this method is for INTEGER DATA and not for columns containing texts.
Let us see how to perform it in both Python and R.
- MISSING DATA IMPUTATION IN PYTHON
The process of Missing Data Imputation in Python is quite involved and we need to go through a lot of steps to achieve it. So, let’s jump into it then.
LIBRARY: We are going to use the Imputer class present in the preprocessing package which is a part of the sklearn library in Python. Therefore, the following command is used for importing the Imputer class from the sklearn library.
from sklearn.preprocessing import Imputer
The above command makes sure that the Imputer class is imported and from this command onward, you can create the object of the Imputer class.
The following is the general command in most of the Object Oriented Programming Language.
object_name = Class_name()
In our case, we are going to create the object of the Imputer class by using the following command.
imputer = Imputer(missing_values = ‘NaN’, strategy = ‘mean’, axis=0)
Now, it seems like a lot of information to process. Therefore, let me explain the above command.
We are passing the following three options while creating an object of the Imputer class.
- missing_values: This option is used to pass which value should be treated as the missing value. In our case, a string NaN is a missing value, which will be visible from the screenshot to come.
- strategy: We are passing ‘mean’ value for this option. It takes three values in Python which are as follows.
- mean: This is the default value. The mean of the column will replace the missing value (NaN in our case).
- median: The median of the column will replace the missing value.
- most_frequent: The most frequent value will replace the missing value for the respective column.
- axis: We are passing 0 for this option. It takes two values, either 0 or 1, where 0 indicates that we want to impute the missing data in columns, whereas 1 indicates the missing data is imputed in rows.
Now, Once we create an object for Imputer() class, we need to fit this imputer object to the feature metrix variable X, which we created in the last tutorial. We can use the following command to do imputer object fitting to X.
imputer = imputer.fit(X[:, 1:3])
As you can see from the above command, fit() method is used for performing this fitting operation. We are passing X[:, 1:3] as the only option to this fit() method. As already explained, Python is 0-level indexing and passing 1:3 as the values for the columns indicate that the missing data imputation should be done for columns 1 and 2, because Python does skip the last column that you pass in the fit() method.
Once the FIT operation is performed, the last activity we have to perform is the TRANSFORMATION and we use the transform() method for doing that.
Please use the following command to perform the TRANSFORMATION.
X[:, 1:3] = imputer.transform(X[:, 1:3])
The input to transform() method is similar to that of the fit() method. The transformed output is stored back into the X variable.
The following screenshot shows this entire process of Missing Data Imputation.
From the above screenshot, you can see that the Missing Data for the Salary column – “NaN” was replaced by the Salary mean 63777.77
Finally, this completes the MISSING DATA IMPUTATION in PYTHON.
After Python, it is time to perform the same process in R.
- MISSING DATA IMPUTATION IN R
Trust me, I have been saying this for some time now that so far, performing Machine Learning operations have been easier in R as compared to Python. Same goes for this tutorial as well.
A simple ifelse() block and a method called is.na() is used for performing this imputation.
The method is.na() returns TRUE if the column contains the NULL/MISSING value and FALSE if it doesn’t.
Therefore, in the ifelse() statement, if the condition returns TRUE, then we will impute the mean of the column, and if it returns FALSE, then the same value should be returned.
We are going to use the following R commands to perform the missing data imputation for Age and Salary columns in the input dataset variable.
dataset$Age = ifelse(is.na(dataset$Age) , mean(dataset$Age, na.rm=TRUE), dataset$Age)
dataset$Salary = ifelse(is.na(dataset$Salary) , mean(dataset$Salary, na.rm=TRUE), dataset$Salary)
As you can see from the above code snippet, we can make the following comments about the code.
- The Age and Salary column in the dataset variable can be accessed as dataset$Age and dataset$Salary.
- The condition is.na(dataset$Age) returns TRUE if the Age is NaN and FALSE if it isn’t.
- The condition is.na(dataset$Salary) returns TRUE if the Salary is NaN and FALSE if it isn’t.
- While calculating the mean of the Age and Salary column, na.rm=TRUE option is specified to tell R that while calculating the mean of these columns, the entries containing the NaN values should be removed, because if those entries are not removed, then the mean of those columns are going to be NA, which is not expected.
Once we run the above commands, you can use the View() command to view the dataset variable. The command looks as follows.
The execution of the above commands looks like following.
As you can see from the above screenshot, the missing value in the Age column was replaced by 38.77 (for Spain) and for the Salary column by 63777.78 (for Germany).
This concludes the Missing Data Imputation in R.
Finally, we come to an end of this tutorial. I hope the code snippet, explanation, and the screenshots help you to understand the concepts related to this topic.