Post 4 | ML | Data Preprocessing – Part 2

Hello everyone, thanks for coming back to the next tutorial in Data Preprocessing step of Machine Learning tutorials.

Just to refresh your memory, in the last tutorial i.e. Part 1 of Data Preprocessing, we saw how to download the dataset and import the required libraries for performing required operations. In this tutorial, we are going to see how to import this downloaded data in both Python and R.

Machine Learning: Data Preprocessing - Part 2
Machine Learning: Data Preprocessing – Part 2

As you can see from the above infographics, we are looking at the Data Import section of the Data Preprocessing in Machine Learning.

Let us start then.

  • IMPORTING DATASET IN PYTHON

Before importing the downloaded dataset in the last tutorial in Python Spyder IDE, we need to make sure we set the working directory for the Spyder IDE.

You can do this by clicking on the File Explorer option. It is available on the Top Right window of the Spyder IDE. You can save the currently used Python file in the folder where you saved your Data.csv CSV file and once you do that, the corresponding folder will be set as the working directory for the Spyder IDE session.

The following screenshot might be able to help you out with doing this.

Step 2: File Explorer
Step 2: File Explorer

As you can see from the above screenshot, “C:\Users\User\Desktop\blog\3 ML Data Preprocessing” is the working directory for this tutorial. This directory contains our Data.csv input file along with the Python file(s).

Once you do this, you can import the downloaded Data.csv file in Python with the help of pandas library. We have already imported the pandas library in Python in the last tutorial, therefore we need to make use of it to import it into the Spyder IDE session.

You can use the following command to import this dataset into Python.

datasets = pd.read_csv(‘Data.csv’)

From the above command, we can say that the data stored in Data.csv will be imported into datasets variable. The variable pd is an alias for the pandas library. The following screenshot shows the execution of the above command.

Step 3: Loading the data
Step 3: Loading the data

Now, once you do that, you will be able to see the datasets variable in the Variable Explorer window on the right side of the Spyder IDE, as shown in the above screenshot.

If you double-click on the datasets variable, a new pop-up window will appear and will show you the data stored in the datasets variable.

For this example, Salary data will be shown in the Float or Decimal format. You can click on the Format button and change the format, you will be able to see the difference. I have changed the format from the float type to integer type.

Now, the next step should be to create Metrix of Features and Dependent variable.
You must know that Python has zero-level indexing i.e. Indexing starts with 0 in python, therefore we will start the indexing process from 0 till the penultimate feature.

The following command is used for performing this operation.

X = dataset.iloc[:, :-1].values

To give an idea about the above command, the first colon (:) indicates that all the rows of the data should be imported and the second colon (:) followed by -1 indicates that all the columns except the last one should be imported.
The .values option tells Python to import the values stored in those rows and columns and finally, the output should be stored in X variable.

This imports the Feature vector into X. Now is the time to import the dependent variable i.e. Output Variable in Y.

For doing this, we use the following command.

Y = dataset.iloc[:, 3].values

The explanation of the above command goes similar to the last one. All the rows and the last column i.e. Purchased, is included in the Y variable.

The following screenshot shows the execution the above commands.

Step 4: Extracting Dependent and Independent Variables
Step 4: Extracting Dependent and Independent Variables

As you can see, both X and Y variables were created successfully from the dataset variable. X has 10 rows and 3 columns whereas Y has 10 rows with 1 column.

This completes the Data Import process in Python.

Now, let us look the same in R.

  • IMPORTING DATASET IN R

Believe me, doing the same Data Import process in R is way easier as compared to in Python.

The first thing to do is to set the working directory. For that, we use the setwd() function.

Please use the following command to set the working directory.

setwd(“C:\\Users\\User\\Desktop\\blog\\3 ML Data Preprocessing”)

The execution of the above command looks as follows.

Step 1: Setting Working Directory
Step 1: Setting Working Directory

You can change the path because it will be different for your system. Once, the above command is executed, you can run the following command to import the Data.csv file into a vector called dataset.

dataset = read.csv(“Data.csv”)

Please notice that we are using read.csv() function to import the CSV file into the dataset vector. To confirm, you need to view the imported dataset variable. For doing this, we use the View()function.

You can use the following command to view the dataset variable.

View(dataset)

The output of the above three commands looks as follows.

step 2: Importing the dataset
step 2: Importing the dataset

This completes the Data Import process in R.

We can conclude this tutorial here. I hope this helps to understand the basic concepts when it comes to Machine Learning.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

1 thought on “Post 4 | ML | Data Preprocessing – Part 2

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: