Hello, everyone. Thanks for joining me in this 5th tutorial of the Data Preprocessing part of the Machine Learning tutorials.
In the last tutorial, we saw how to convert the CATEGORICAL VARIABLES from the STRING format to an INTEGER format. In this tutorial, we are going a step ahead and are going to split the original data set into two data sets. These data sets are called as TRAINING DATASET and TEST DATASET.
The following infographics show the progress we have done so far.
As you can see, we are left with the last two tutorials in the Data Preprocessing for Machine Learning after which the fun stuff is going to start.
Before starting to create the TEST and TRAIN data sets from the input data, we should know why we are creating these datasets.
As this blog is about the Machine Learning, our final aim is to create a machine which is going to learn. In this process, we are going to create some of the models based on Machine Learning algorithms. Now, these models are built on the data sets which are called Training data sets. Once built, these datasets are tested against a different dataset which is called Test dataset.
This process of model building is an iterative process and it takes some time to attain a certain accuracy. Unless a certain threshold is passed on the test data set, the process of building a model is continued recursively by tweaking a variable or more.
This enables us to include an independent section such as this in the Data preprocessing section of the Machine Learning tutorials.
Let us start with the creation of Train and Test datasets using both Python and R.
- CREATING TRAIN AND TEST DATASETS IN PYTHON
As seen in the last tutorial, the variables X and y contain the independent and response variable respectively. To refresh your memory, following the data contained by X and y variables.
Now that we are clear with X and y variables, let us see how to divide these records into TRAIN and TEST data sets.
For doing this, we can use the train_test_split package in sklearn.cross_validation library. This import operation is done with the help of the following command.
from sklearn.cross_validation import train_test_split
Once the import operation is done, creating TRAIN and TEST data sets is very easy in Python. It can be done with the following command.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
As can be seen from the above command, we need to pass the four variable names i.e. X_train, X_test, y_train, and y_test on the left-hand side, because the output of the train_test_split() function creates four variables which are expected in the same format i.e. X_TRAINING, X_TEST, Y_TRAINING, and Y_TEST.
We need to pass the X and y variables along with the proportion of the TRAIN and TEST data sets. Here, we are passing test_size=0.2, which indicates that out of a total number of records, 80% should be sent to TRAINING dataset whereas remaining 20% should be sent to TEST data sets. This proportion depends on you and you can give it any value as per your choice, but we need to consider business logic as well while determining this. If this value is not passed, then the default value of 0.3 is considered for this parameter.
If the explanation is clear, let us how the variables X_train, X_test, y_train, and y_test look like.
As can be seen from the above screenshot, a total of 8 records (80%) is part of the TRAIN dataset, whereas the remaining 2 (20%) is part of the TEST data set.
This completes the TRAIN and TEST dataset creation in Python. Let us look at the same process in R.
- CREATING TRAIN AND TEST DATASETS IN R
In R, we can use the library called “caTools” for performing this operation. Before using the “caTools” library, we need to install it.
For installing the caTools library, we have to use the following command.
Once the installation is done, we need to load this library in the R environment. For doing this, we use the following command.
Now, the above command makes sure that the library is properly loaded into the R session and we are able to perform the desired activity i.e. creating TRAIN and TEST data sets.
If you remember, we had our data set stored in the variable called “dataset”. To refresh your memory, it looked as follows.
As can be seen from the above screenshot, there are three independent variables and one response variable called “Purchased” in the dataset variable.
The process of creating TRAIN and TEST datasets in R is dependent on the Response variable. Therefore, we will use the “Purchased” variable while creating these TRAIN and TEST data sets. The following command is used for creating this SPLIT.
split = sample.split(y = dataset$Purchased, SplitRatio = 0.8)
In the above command, the function sample.split() expects the variable on which the split operation needs to be performed and the split ratio.
There is a difference between the split ratio in Python and R. In Python, the split ratio indicated the portion for the TEST dataset whereas, in R, this split ratio indicates the percentage for the TRAINING data set. Also, in R, we need to pass only the response variable whereas, in Python, we passed both independent and response variables.
Now, we have got the split variable which contains the response variable and the split ratio. This split variable can be used for creating the TRAINING and TEST data set. We use the following commands to create these data sets.
training_set = subset(dataset , split == TRUE)
test_set = subset(dataset , split == FALSE)
As can be seen from the above code snippet, subset() function is used for creating TRAIN and TEST data sets. We need to pass the data along with the split flag. If the split flag is TRUE then it creates a TRAINING dataset and if it is FALSE, then TEST data set is created.
The following screenshot shows both the TRAINING and TEST data set which we created using R. We have used the same commands mentioned above for creating these data sets.
And with this, we can conclude the portion to create the TRAIN and TEST data sets in both Python and R.
I hope you guys like the contents.
In the next tutorial, we are going to conclude the Data Preprocessing part of the Machine Learning tutorials with a focus on Feature Scaling.
Till then, stay tuned! And keep following my blog.