Post 3 | ML | Data Preprocessing – Part 1

In the last tutorial, we saw the installation steps for both R and Python along with their respective IDEs. In this tutorial, we are going to start our actual journey of Machine Learning. We are going to start off with the Data Preprocessing part, which is one of the most important aspects of the Machine Learning.

We should not overlook this as it is very important and therefore I am going to dedicate a certain amount of time to explain the concepts of Data Preprocessing.

The following infographics show the different portions we are going to touch while doing the Data Preprocessing.

Machine Learning: Data Preprocessing
Machine Learning: Data Preprocessing

As you can see, it is a 7 step process and based on the efforts required, I have divided these seven steps into 6 posts so that each concept is explained clearly with specific attention to details.

In this post, we are going to cover the first two sections of the Data Preprocessing in Machine Learning. These two sections are as follows.

  • Get the Dataset
  • Import the libraries

Let us start with each of these steps then.


You can get the dataset for this tutorial from my GitHub Profile under Machine Learningrepository’s data-preprocessing branch. The name of this file is Data.csv and you can download this file by clicking here. This Data.csv file looks something like this.

Country Age Salary Purchased
France 44 72000 No
Spain 27 48000 Yes
Germany 30 54000 No
Spain 38 61000 No
Germany 40 Yes
France 35 58000 Yes
Spain 52000 No
France 48 79000 Yes
Germany 50 83000 No
France 37 67000 Yes

view raw
hosted with ❤ by GitHub

As you can see from the above Data.csv snippet, this file contains a total of 4 columns and 10 records. The columns are Country, Age, Salary, and Purchased. The last column Purchased is a type of flag with values Yes and No.

This Data.csv file is going to be very useful while explaining the concepts like Missing Data, Categorical variables, and feature scaling.

Now, in order to create this dataset in your system (which should be Windows, since I am using Windows for executing all the commands in this tutorial series), you can download this file from the link given above and save it in some location, which will be easily accessible for you.

Now, once you get the dataset and save it in your system and open it using the MS Excel, it looks as follows.

Input Data CSV File
Input Data CSV File

This confirms that the data was successfully downloaded and loaded into your Windows system.

Now, it is time to import the required libraries in both Python and R.


When it comes to importing the required libraries, we need to perform this task both in Python and R. So, we will do it first in Python and then in R.


We are going to import the following three libraries in Python using Spyder IDE.




Let me explain the usage of each of the above library.

“numpy” library in Python contains an ability to perform mathematical operations. Since we are dealing with the Machine Learning operations, we will need to perform some of the mathematical operations from time-to-time, therefore having numpy handy will save a lot of our time and will smoothen the process.

“matplotlib.pyplot” library is useful to plot charts and graphs in Python. The name itself indicates the usage of this library. Having visual representation of the data sometimes gives us answers to those questions which we don’t have, therefore it is highly recommended to have these charts and graphs at your disposal.

“pandas” library in Python is used for dealing with the datasets. The operations like import and manage can be done with the help of this library.

To import these libraries in Python, we use the following commands on the Spyder IDE.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

where np, plt, and pd are ALIASES for numpy, matplotlib.pyplot, and pandas library respectively.

The screenshot depicting the output of the above commands is shown as follows.

Step 1: Importing Libraries
Step 1: Importing Libraries

The above screenshot shows that all the three libraries were successfully imported into the Spyder IDE for further processing.

This completes the Library import operation for Python.

Now, let us see how to perform the same in R.


Interestingly, we do not need to do any library import operation in R as of now but will do so whenever required, which is easier than Python.

So, we can safely say that the library import operation is performed successfully for both Python and R.

This completes the tutorial here. In the next tutorial, we are going to see how to import the recently downloaded dataset Data.csv into both Python and R.

I hope you guys are liking the content. Please share it across your network so that it will reach more audience.

Please stay tuned for the further updates.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.



Published by milindjagre

I founded my blog four years ago and am currently working as a Data Scientist Analyst at the Ford Motor Company. I graduated from the University of Connecticut pursuing Master of Science in Business Analytics and Project Management. I am working hard and learning a lot of new things in the field of Data Science. I am a strong believer of constant and directional efforts keeping the teamwork at the highest priority. Please reach out to me at for further information. Cheers!

5 thoughts on “Post 3 | ML | Data Preprocessing – Part 1

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: