Post 1 | CCA175 | Introduction

Let’s kick off the new series of blog posts starting today.

In this series, we are going to prepare ourselves for the CCA Spark and Hadoop Developer Exam (CCA175 certification).

Through these blog posts, we will try to prepare for the CCA175 certification. There are a few things mentioned below that we need to know before starting off with the actual tutorials.

Cost: $295 (The cost itself makes it very important that we take this certification exam seriously)
Duration: 2 hours
Passing Criteria: 70% of the total questions asked

The preparation of the CCA175 exam is divided into the following four sections.

  1. Data Ingest
  2. Transform, Stage, and Store
  3. Data Analysis
  4. Configuration

Above mentioned categories contain specific tasks which we should be familiar with to excel the certification with flying colors.

We will see tasks in each category, which are as follows.

DATA INGEST
Bringing data in Hadoop Ecosystem

Our focus is to bring/tranfer data in/to Hadoop Ecosystem. The data source system can be within our environment or can be a third party for achieving these tasks.

This section expects that you have skills to transfer data between external systems and your cluster, which includes the following:

  • Import data from a MySQL database into HDFS using Sqoop
  • Export data to a MySQL database from HDFS using Sqoop
  • Change the delimiter and file format of data during import using Sqoop
  • Ingest real-time and near-real-time streaming data into HDFS
  • Process streaming data as it is loaded onto the cluster
  • Load data into and out of HDFS using the Hadoop File System commands

The second section focuses on performing transformation activities on the data that we ingested using skills mentioned above.

TRANSFORM, STAGE, AND STORE
Transforming the ingested data in business defined format.

This section covers converting a set of data values in a given format stored in HDFS into new data values or a new data format and write them into HDFS. It includes the follwing tasks.

  • Load RDD data from HDFS for use in Spark applications
  • Write the results from an RDD back into HDFS using Spark
  • Read and write files in a variety of file formats
  • Perform standard extract, transform, load (ETL) processes on data

The third section focuses on analyzing the data that we transformed in the above section.

DATA ANALYSIS
Analyzing of the transformed data.

We are going to use Spark SQL to interact with the metastore programmatically in our applications. The goal of this section is to generate reports by using queries against loaded data. The following tasks are expected to be part of this section.

  • Use metastore tables as an input source or an output sink for Spark applications
  • Understand the fundamentals of querying datasets in Spark
  • Filter data using Spark
  • Write queries that calculate aggregate statistics
  • Join disparate datasets using Spark
  • Produce ranked or sorted data

The last section relates to performing some configurational activities to either change the session/system level configuration or change the format in which we receive the output.

CONFIGURATION
Preparing to work with different cluster components.

We must be able to perform tasks such as follows to clear this section.

  • Supply command-line options to change your application configuration, such as increasing available memory

These were the details regarding the certification exam. Please note that we do get a set of reference documents available during the certification exam, which is very useful expecially when we are a little bit confised/clueless about the syntax of a command, or some other technical difficulties. I can say this with my experience. Please do not hesitate to use these resources for your benefit.

I am sure we will learn more things as we embark upon this beatiful journey.

I am hopeful that this will help at least one person with their certification preparation.

You can find more information about CCA175 certification at https://www.cloudera.com/about/training/certification/cca-spark.html

Please let me know if you have suggestions to improve these tutorials.

Thank you!
Cheers!

Post 7 | ML| Data Preprocessing – Part 6

Hello, everyone. Welcome to the last tutorial in the Data Preprocessing portion of the Machine Learning tutorials. In the last tutorial, we saw how to create the TRAINING and TEST data sets for model building purposes. In this tutorial, we are going to see why and how to perform the Feature Scaling.

Let us begin, then. To refresh your memory, the following is the info graphics about our progress so far.

Machine Learning: Data Preprocessing
Machine Learning: Data Preprocessing

This is one of the most important topics in the data preprocessing section as it enables us to make some sense out of the available data.

We will begin with explaining what is Feature Scaling, why to do it, and finally, how to do it in both Python and R.

  • WHAT IS FEATURE SCALING?

In simple words, Feature Scaling is converting two or more variables in the same range. It is highly recommended to perform Feature Scaling as one of the steps in Data Preprocessing. We do it to bring the concerned variables in the same range.

I know the following two ways to perform feature scaling.

  • Standardization

We can perform Feature Scaling by converting the input variable into the standardized variable. This is called Standardization. The formula for doing this is as follows.

new_var = ( old_var – mean(old_var) ) / stddev(old_var)

where,
new_var -> the new variable to be created
old_var -> the original variable to be standardized
mean(old_var) -> mean of the original variable
stddev(old_var) -> standard deviation of the original variable

  • Normalization

This is the other method of performing the Feature Scaling. Instead of dealing with parameters such as mean and the standard deviation of the variable, we use the minimum and maximum values of the variable. By doing this, we try to narrow down the gaps between multiple variables and try to bring them closer together i.e. within the same scale.

The formula for Normalization is as follows.

new_var = ( old_var – min(old_var) ) / (max(old_var) – min(old_var))

where,
new_var -> the new variable to be created
old_var -> the original variable to be standardized
min(old_var) -> the minimum value of the original variable
max(old_var) -> the maximum value of the original variable

Now, if these two concepts are clear, let us see why we actually perform Feature Scaling.

  • WHY DO WE PERFORM FEATURE SCALING?

As we all know, Machine Learning models are entirely dependent on numerical information. The variables which are part of these models are nothing but numbers. Even if you pass the variables containing the string or character information, these models will internally convert those variables in numeric format and perform their prediction task.

The problem with this approach is that if the scale of two variables is not the same, then the variable with higher integer value than the other will have more impact on the model performance as compared to the other one, which might hamper the prediction task. This approach might not work every time.

For example, in the current scenario, the two variables in question are SALARY and AGE. The range and scale of these two variables are not the same and differ a lot. Since SALARY has higher integer values, it will have more impact on the model performance as compared to the AGE variable. But, it is not the case and the domain knowledge suggests that both SALARY and AGE have equal importance in model building.

The following input data snippet shows you the difference in the scale between SALARY and AGE variables.

Input CSV file
Input CSV file

Now that we have the first-hand knowledge of What and Why of Feature Scaling. Let us answer the How part of the question.

We will perform the Feature Scaling in both Python and R.

  • FEATURE SCALING USING PYTHON

For performing Feature Scaling in Python, StandardScaler package is used in the preprocessing library which is a part of the sklearn library. We can import this package with the help of the following command.

from sklearn.preprocessing import StandardScaler

After doing the import, we have to create the object of StandardScaler and that object is then used to perform the Feature Scaling operations. We can use the following command to create this object.

sc_X = StandardScaler()

The newly created object sc_X is used for performing the Feature Scaling. We will perform this operation on X_train and X_test variables. But, before doing that, let us take a look at these variables.

Train and Test dataset
Train and Test dataset

As you can see from the above screenshot, the scale of AGE and SALARY variables is significantly different. Now, let us try to scale these variables.

We use the following commands to perform this Feature Scaling.

X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

You might be wondering why there is a difference between two commands in the above code snippet. The reason behind this is that we have already done the FIT operation on the sc_X object, therefore we did not need to perform the FIT operation again and directly TRANSFORMATION can be applied to the sc_X object.

If the above explanation is understood, please have a look at the following screenshot which shows the values of X_train and X_test variables after performing the Feature Scaling.

Feature Scaling in Python
Feature Scaling in Python

As can be seen from the above screenshot, SALARY and AGE variables (column 3 and 4) are converted to the same scale, which is good for Machine Learning models.

This completes the Feature operation using Python. Let us perform the same thing using R.

  • FEATURE SCALING USING R

Performing the Feature Scaling in R is easier as compared to Python.

Since we are going to scale only SALARY and AGE variables, while performing the Feature Scaling, we must pass these variables. If mistakenly, you pass a variable with string or character values, then scale() function returns an error saying ‘x’ must be numeric.

We use the following command to perform this Feature Scaling.

training_set[, c(2,3)] = scale(training_set[, c(2,3)])
test_set[, c(2,3)] = scale(test_set[, c(2,3)])

As you can see, the indices 2 and 3 are for AGE and SALARY variables, respectively.

The output of the above two commands is as follows.

Feature Scaling in R
Feature Scaling in R

As you can see, the variables AGE and SALARY are of the same scale now, as expected.

This completes the Feature Scaling using R.

And now, with this, I am glad to inform that we are done with the data preprocessing part of the Machine Learning tutorials.

In the next tutorial, we are going to start off with the REGRESSION. In this portion, we are going to focus on the following things.

  1. Simple Linear Regression
  2. Multiple Linear Regression
  3. Polynomial Regression
  4. Support Vector Regression (SVR)
  5. Decision Tree Regression
  6. Random Forest Regression
  7. Evaluating Regression Models Performance

Hope you guys like the contents. Please stay tuned and follow my blog.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

Post 7 | ML | Data Preprocessing – Part 5

Hello, everyone. Thanks for joining me in this 5th tutorial of the Data Preprocessing part of the Machine Learning tutorials.

In the last tutorial, we saw how to convert the CATEGORICAL VARIABLES from the STRING format to an INTEGER format. In this tutorial, we are going a step ahead and are going to split the original data set into two data sets. These data sets are called as TRAINING DATASET and TEST DATASET.

The following infographics show the progress we have done so far.

Machine Learning: Data Preprocessing - Part 5
Machine Learning: Data Preprocessing – Part 5

As you can see, we are left with the last two tutorials in the Data Preprocessing for Machine Learning after which the fun stuff is going to start.

Before starting to create the TEST and TRAIN data sets from the input data, we should know why we are creating these datasets.

As this blog is about the Machine Learning, our final aim is to create a machine which is going to learn. In this process, we are going to create some of the models based on Machine Learning algorithms. Now, these models are built on the data sets which are called Training data sets. Once built, these datasets are tested against a different dataset which is called Test dataset.

This process of model building is an iterative process and it takes some time to attain a certain accuracy. Unless a certain threshold is passed on the test data set, the process of building a model is continued recursively by tweaking a variable or more.

This enables us to include an independent section such as this in the Data preprocessing section of the Machine Learning tutorials.

Let us start with the creation of Train and Test datasets using both Python and R.

  • CREATING TRAIN AND TEST DATASETS IN PYTHON

As seen in the last tutorial, the variables X and y contain the independent and response variable respectively. To refresh your memory, following the data contained by X and y variables.

X and Y variables in the dataset
X and Y variables in the dataset

Now that we are clear with X and y variables, let us see how to divide these records into TRAIN and TEST data sets.

For doing this, we can use the train_test_split package in sklearn.cross_validation library. This import operation is done with the help of the following command.

from sklearn.cross_validation import train_test_split

Once the import operation is done, creating TRAIN and TEST data sets is very easy in Python. It can be done with the following command.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

As can be seen from the above command, we need to pass the four variable names i.e. X_train, X_test, y_train, and y_test on the left-hand side, because the output of the train_test_split() function creates four variables which are expected in the same format i.e. X_TRAINING, X_TEST, Y_TRAINING, and Y_TEST.

We need to pass the X and y variables along with the proportion of the TRAIN and TEST data sets. Here, we are passing test_size=0.2, which indicates that out of a total number of records, 80% should be sent to TRAINING dataset whereas remaining 20% should be sent to TEST data sets. This proportion depends on you and you can give it any value as per your choice, but we need to consider business logic as well while determining this. If this value is not passed, then the default value of 0.3 is considered for this parameter.

If the explanation is clear, let us how the variables X_train, X_test, y_train, and y_test look like.

TRAIN and TEST dataset
TRAIN and TEST dataset

As can be seen from the above screenshot, a total of 8 records (80%) is part of the TRAIN dataset, whereas the remaining 2 (20%) is part of the TEST data set.

This completes the TRAIN and TEST dataset creation in Python. Let us look at the same process in R.

  • CREATING TRAIN AND TEST DATASETS IN R

In R, we can use the library called “caTools” for performing this operation. Before using the “caTools” library, we need to install it.

For installing the caTools library, we have to use the following command.

install.packages(“caTools”)

Once the installation is done, we need to load this library in the R environment. For doing this, we use the following command.

library(caTools)

Now, the above command makes sure that the library is properly loaded into the R session and we are able to perform the desired activity i.e. creating TRAIN and TEST data sets.

If you remember, we had our data set stored in the variable called “dataset”. To refresh your memory, it looked as follows.

Converting categorical to continuous variables in R
Converting categorical to continuous variables in R

As can be seen from the above screenshot, there are three independent variables and one response variable called “Purchased” in the dataset variable.

The process of creating TRAIN and TEST datasets in R is dependent on the Response variable. Therefore, we will use the “Purchased” variable while creating these TRAIN and TEST data sets. The following command is used for creating this SPLIT.

split = sample.split(y = dataset$Purchased, SplitRatio = 0.8)

In the above command, the function sample.split() expects the variable on which the split operation needs to be performed and the split ratio.

There is a difference between the split ratio in Python and R. In Python, the split ratio indicated the portion for the TEST dataset whereas, in R, this split ratio indicates the percentage for the TRAINING data set. Also, in R, we need to pass only the response variable whereas, in Python, we passed both independent and response variables.

Now, we have got the split variable which contains the response variable and the split ratio. This split variable can be used for creating the TRAINING and TEST data set. We use the following commands to create these data sets.

training_set = subset(dataset , split == TRUE)
test_set = subset(dataset , split == FALSE)

As can be seen from the above code snippet, subset() function is used for creating TRAIN and TEST data sets. We need to pass the data along with the split flag. If the split flag is TRUE then it creates a TRAINING dataset and if it is FALSE, then TEST data set is created.

The following screenshot shows both the TRAINING and TEST data set which we created using R. We have used the same commands mentioned above for creating these data sets.

TEST and TRAIN datasets in R
TEST and TRAIN datasets in R

And with this, we can conclude the portion to create the TRAIN and TEST data sets in both Python and R.

I hope you guys like the contents.

In the next tutorial, we are going to conclude the Data Preprocessing part of the Machine Learning tutorials with a focus on Feature Scaling.

Till then, stay tuned! And keep following my blog.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

Post 6 | ML | Data Preprocessing – Part 4

Hello, everyone. Welcome to the Part 4 of the Data Preprocessing of the Machine Learning tutorials. In the last tutorial, we saw how to impute the Missing Data in both Python and R. In this tutorial, we are going to see how to deal with the qualitative entries in the given data.

The following infographics show our progress in the Data Preprocessing section of the Machine Learning.

Machine Learning: Data Preprocessing - Part 4
Machine Learning: Data Preprocessing – Part 4

So we are almost there with the Data Preprocessing portion.

Let us start digging deep to know how to deal with Categorical Variables for building the Machine Learning Models.

There are plenty of reasons because of which we try to convert the Categorical Variables into the Continuous Variables. The most important reason for this is that it helps the Machine Learning Models to build and perform faster. I have worked on tools such as SAS-JMP and SAS-EM. The good/bad thing about these tools is that you get the work done, but you don’t get to know how it is actually doing it. Therefore, we should follow the traditional approach of building the Machine Learning models from the scratch so that we will have the total control over the outcome of the model.

Having said that, let us start off with the Categorical Variables handing in both Python in R.

  • CATEGORICAL VARIABLES CONVERSION IN PYTHON

For doing this very important task in Python, we have two libraries. The first one is “LabelEncoder” and the second is “OneHotEncoder“. Both of these libraries are part of the “sklearn.preprocessing” package in Python.

Let us first start with LabelEncoder library.

You can import this library by running the following command.

from sklearn.preprocessing import LabelEncoder

Once LabelEncoder is imported, you can create an object for it as shown below.

labelencoder_X = LabelEncoder()

Once the labelencoder_X  object is created, we need to perform the FIT and TRANSFORM operation on the column which contains the qualitative data. To refresh your memory, our data set looks like this.

input.csv file
input.csv file

As can be seen, we need to transform COUNTRY and PURCHASED columns and convert these CATEGORICAL VARIABLES into CONTINUOUS VARIABLES. It means columns 0 and 3 (index) needs transformation. The following command is used for doing this transformation using LabelEncoder library.

X[:, 0] = labelencoder_X.fit_transform(X[:, 0])

The problem with the above approach is that it assigns value 0, 1, and 2 to three different countries, which is correct. But, it doesn’t make sense while building the Machine Learning models, because, when the model sees values such 0, 1, and 2, it does not understand the context of these values and treats those as integer values. Therefore, making the coefficients NULL for value 0 and giving more importance to coefficients with value 2 as compared to value 1.

Theoretically, this might make sense, but if we look at the data, then that is hardly the case. Here, we are trying to convert the CITY name into an integer, therefore every integer value has equal importance, which should not be violated by assigning values such as 0, 1, and 2.

This can be done in a better way and that is known as the DUMMY VARIABLE CREATION.
In this, instead of having only one column with three different values, we have three columns with either of the two values i.e. 0 or 1. This makes it easier to figure out the entry for that particular column.

Let us see how to perform the DUMMY VARIABLE CREATION in Python. We use the OneHotEncoder package in sklearn.preprocessing package in Python for doing this.

You can import this package in Python using the following command.

from sklearn.preprocessing import OneHotEncoder

Once the package is imported, we need to create the object of OneHotEncoder class. You can use the following command to create this object.

onehotencoder = OneHotEncoder(categorical_features = [0])

Here, as you can see, we are passing the categorical feature variable present at index 0 i.e. the first (COUNTRY) column.

Once the object is created, we can perform the FIT and TRANSFORM operation to convert the CATEGORICAL variable into the CONTINUOUS variable. The following command is used for performing this type conversion.

X = onehotencoder.fit_transform(X).toarray()

As you can see the above command, the variable X, which is already defined as the COUNTRY variable in the earlier section of this tutorial, is sent to FIT and TRANSFORM operation, followed by the conversion to an array.  This converts the COUNTRY variable from a CATEGORICAL variable to a CONTINUOUS variable.

Now, let us convert the Response variable Y in the similar fashion.
As you can see from the below data snippet, the response variable D contains only two types of values: No and Yes.

input.csv file
input.csv file

When a column contains two types of values, it is recommended to use LabelEncoder as compared to OneHotEncoder. Hence, for converting the response variable Responded, we are going to use the following command.

labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

In the above commands, y is the RESPONSE variable which we have already defined in the initial section of this tutorial. fit_transform() method is responsible for converting the RESPONSE variable y from the CATEGORICAL type to the CONTINUOUS type.

The following screenshot gives you an idea about the above code execution results.

Categorical Variables
Categorical Variables

As you can see from the above screenshot, both the COUNTRY and the RESPONDED variable was converted to the CONTINUOUS variables with values either 0 or 1.

This completes the categorical variable conversion in Python.

Now, let us see how the very process can be done using R programming language.

  • CATEGORICAL VARIABLES CONVERSION IN R

In R, surprisingly, it is very easy to perform this variable type conversion. Both predictors and response variables are converted to continuous variables using the factor() function. The difference between Python and R is that in R, you need to specify which STRING value should be converted to what INTEGER value. For example,

the country “France” should be converted to what integer value such as 1, 2, 3, and so on.

If you remember, our input data (before the missing data imputation) looks like this in RStudio.

Importing the dataset
Importing the dataset

Here, the variables Country and Purchased are converted from the CATEGORICAL STRING format to the INTEGER format with the help of the following two lines.

dataset$Country = factor( x = dataset$Country, levels = c(‘France’, ‘Spain’, ‘Germany’), labels = c(1, 2, 3) )

dataset$Purchased = factor( x = dataset$Purchased, levels = c(‘No’, ‘Yes’), labels = c(0, 1) )

The above two commands do execute and give us the following output.

Converting Categorical Variables
Converting Categorical Variables

This completes the Categorical variable conversion using R.

Hope the contents and screenshots help you to understand these Machine Learning Data Preprocessing concepts.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

Post 5 | ML | Data Preprocessing – Part 3

Hello, everyone. Thanks for coming back for the third part of the Data Preprocessing section of the Machine Learning tutorial series.

In the last tutorial, i.e. Part 2, we saw how to import the downloaded dataset. In this tutorial, we are going to see how to impute the missing data in the input data.

The following infographics might come handy to refresh your memory.

Machine Learning: Data Preprocessing: Part 3
Machine Learning: Data Preprocessing: Part 3

As you can see, after this tutorial, we are left with the Categorical variable conversion, the creation of train and test dataset, and feature scaling. Let us start with the missing data imputation, then.

It is time for some theory, and then we will how to implement it in both Python and R.

  • BEST PRACTICES: MISSING DATA IMPUTATION

In simple words, imputing a missing data means replacing it with some other value. Generally, following are the best practices when it comes to missing data imputation.

  • Mean Substitution: replacing missing data with the mean of the column
  • Median Substitution: replacing missing data with the median of the column
  • Regression Substitution
  • Multiple Imputation Methods

Apart from above four, there are a number of other methods which can be used for doing the imputation of the missing data.

For this tutorial, we are going to use the first method of the Missing Data Imputation listed above. We will replace the missing data with the mean/average of the column, which makes sense over the other methods. Obviously, this method is for INTEGER DATA and not for columns containing texts.

Let us see how to perform it in both Python and R.

  • MISSING DATA IMPUTATION IN PYTHON

The process of Missing Data Imputation in Python is quite involved and we need to go through a lot of steps to achieve it. So, let’s jump into it then.

LIBRARYWe are going to use the Imputer class present in the preprocessing package which is a part of the sklearn library in Python. Therefore, the following command is used for importing the Imputer class from the sklearn library.

from sklearn.preprocessing import Imputer

The above command makes sure that the Imputer class is imported and from this command onward, you can create the object of the Imputer class.

The following is the general command in most of the Object Oriented Programming Language.

object_name = Class_name()

In our case, we are going to create the object of the Imputer class by using the following command.

imputer = Imputer(missing_values = ‘NaN’, strategy = ‘mean’, axis=0)

Now, it seems like a lot of information to process. Therefore, let me explain the above command.
We are passing the following three options while creating an object of the Imputer class.

  • missing_values: This option is used to pass which value should be treated as the missing value. In our case, a string NaN is a missing value, which will be visible from the screenshot to come.
  • strategy: We are passing ‘mean’ value for this option. It takes three values in Python which are as follows.
    • mean: This is the default value. The mean of the column will replace the missing value (NaN in our case).
    • median: The median of the column will replace the missing value.
    • most_frequent: The most frequent value will replace the missing value for the respective column.
  • axis: We are passing 0 for this option. It takes two values, either 0 or 1, where 0 indicates that we want to impute the missing data in columns, whereas 1 indicates the missing data is imputed in rows.

Now, Once we create an object for Imputer() class, we need to fit this imputer object to the feature metrix variable X, which we created in the last tutorial. We can use the following command to do imputer object fitting to X.

imputer = imputer.fit(X[:, 1:3])

As you can see from the above command, fit() method is used for performing this fitting operation. We are passing X[:, 1:3] as the only option to this fit() method. As already explained, Python is 0-level indexing and passing 1:3 as the values for the columns indicate that the missing data imputation should be done for columns 1 and 2, because Python does skip the last column that you pass in the fit() method.

Once the FIT operation is performed, the last activity we have to perform is the TRANSFORMATION and we use the transform() method for doing that.

Please use the following command to perform the TRANSFORMATION.

X[:, 1:3] = imputer.transform(X[:, 1:3])

The input to transform() method is similar to that of the fit() method. The transformed output is stored back into the X variable.

The following screenshot shows this entire process of Missing Data Imputation.

Missing Data Imputation in Python
Missing Data Imputation in Python

From the above screenshot, you can see that the Missing Data for the Salary column – “NaN” was replaced by the Salary mean 63777.77

Finally, this completes the MISSING DATA IMPUTATION in PYTHON.

After Python, it is time to perform the same process in R.

  • MISSING DATA IMPUTATION IN R

Trust me, I have been saying this for some time now that so far, performing Machine Learning operations have been easier in R as compared to Python. Same goes for this tutorial as well.

A simple ifelse() block and a method called is.na() is used for performing this imputation.

The method is.na() returns TRUE if the column contains the NULL/MISSING value and FALSE if it doesn’t.

Therefore, in the ifelse() statement, if the condition returns TRUE, then we will impute the mean of the column, and if it returns FALSE, then the same value should be returned.

We are going to use the following R commands to perform the missing data imputation for Age and Salary columns in the input dataset variable.

dataset$Age = ifelse(is.na(dataset$Age) , mean(dataset$Age, na.rm=TRUE), dataset$Age)

dataset$Salary = ifelse(is.na(dataset$Salary) , mean(dataset$Salary, na.rm=TRUE), dataset$Salary)

As you can see from the above code snippet, we can make the following comments about the code.

  • The Age and Salary column in the dataset variable can be accessed as dataset$Age and dataset$Salary.
  • The condition is.na(dataset$Age) returns TRUE if the Age is NaN and FALSE if it isn’t.
  • The condition is.na(dataset$Salary) returns TRUE if the Salary is NaN and FALSE if it isn’t.
  • While calculating the mean of the Age and Salary column, na.rm=TRUE option is specified to tell R that while calculating the mean of these columns, the entries containing the NaN values should be removed, because if those entries are not removed, then the mean of those columns are going to be NA, which is not expected.

Once we run the above commands, you can use the View() command to view the dataset variable. The command looks as follows.

View(dataset)

The execution of the above commands looks like following.

Imputing the missing values using R
Imputing the missing values using R

As you can see from the above screenshot, the missing value in the Age column was replaced by 38.77 (for Spain) and for the Salary column by 63777.78 (for Germany).

This concludes the Missing Data Imputation in R.

Finally, we come to an end of this tutorial. I hope the code snippet, explanation, and the screenshots help you to understand the concepts related to this topic.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

Post 4 | ML | Data Preprocessing – Part 2

Hello everyone, thanks for coming back to the next tutorial in Data Preprocessing step of Machine Learning tutorials.

Just to refresh your memory, in the last tutorial i.e. Part 1 of Data Preprocessing, we saw how to download the dataset and import the required libraries for performing required operations. In this tutorial, we are going to see how to import this downloaded data in both Python and R.

Machine Learning: Data Preprocessing - Part 2
Machine Learning: Data Preprocessing – Part 2

As you can see from the above infographics, we are looking at the Data Import section of the Data Preprocessing in Machine Learning.

Let us start then.

  • IMPORTING DATASET IN PYTHON

Before importing the downloaded dataset in the last tutorial in Python Spyder IDE, we need to make sure we set the working directory for the Spyder IDE.

You can do this by clicking on the File Explorer option. It is available on the Top Right window of the Spyder IDE. You can save the currently used Python file in the folder where you saved your Data.csv CSV file and once you do that, the corresponding folder will be set as the working directory for the Spyder IDE session.

The following screenshot might be able to help you out with doing this.

Step 2: File Explorer
Step 2: File Explorer

As you can see from the above screenshot, “C:\Users\User\Desktop\blog\3 ML Data Preprocessing” is the working directory for this tutorial. This directory contains our Data.csv input file along with the Python file(s).

Once you do this, you can import the downloaded Data.csv file in Python with the help of pandas library. We have already imported the pandas library in Python in the last tutorial, therefore we need to make use of it to import it into the Spyder IDE session.

You can use the following command to import this dataset into Python.

datasets = pd.read_csv(‘Data.csv’)

From the above command, we can say that the data stored in Data.csv will be imported into datasets variable. The variable pd is an alias for the pandas library. The following screenshot shows the execution of the above command.

Step 3: Loading the data
Step 3: Loading the data

Now, once you do that, you will be able to see the datasets variable in the Variable Explorer window on the right side of the Spyder IDE, as shown in the above screenshot.

If you double-click on the datasets variable, a new pop-up window will appear and will show you the data stored in the datasets variable.

For this example, Salary data will be shown in the Float or Decimal format. You can click on the Format button and change the format, you will be able to see the difference. I have changed the format from the float type to integer type.

Now, the next step should be to create Metrix of Features and Dependent variable.
You must know that Python has zero-level indexing i.e. Indexing starts with 0 in python, therefore we will start the indexing process from 0 till the penultimate feature.

The following command is used for performing this operation.

X = dataset.iloc[:, :-1].values

To give an idea about the above command, the first colon (:) indicates that all the rows of the data should be imported and the second colon (:) followed by -1 indicates that all the columns except the last one should be imported.
The .values option tells Python to import the values stored in those rows and columns and finally, the output should be stored in X variable.

This imports the Feature vector into X. Now is the time to import the dependent variable i.e. Output Variable in Y.

For doing this, we use the following command.

Y = dataset.iloc[:, 3].values

The explanation of the above command goes similar to the last one. All the rows and the last column i.e. Purchased, is included in the Y variable.

The following screenshot shows the execution the above commands.

Step 4: Extracting Dependent and Independent Variables
Step 4: Extracting Dependent and Independent Variables

As you can see, both X and Y variables were created successfully from the dataset variable. X has 10 rows and 3 columns whereas Y has 10 rows with 1 column.

This completes the Data Import process in Python.

Now, let us look the same in R.

  • IMPORTING DATASET IN R

Believe me, doing the same Data Import process in R is way easier as compared to in Python.

The first thing to do is to set the working directory. For that, we use the setwd() function.

Please use the following command to set the working directory.

setwd(“C:\\Users\\User\\Desktop\\blog\\3 ML Data Preprocessing”)

The execution of the above command looks as follows.

Step 1: Setting Working Directory
Step 1: Setting Working Directory

You can change the path because it will be different for your system. Once, the above command is executed, you can run the following command to import the Data.csv file into a vector called dataset.

dataset = read.csv(“Data.csv”)

Please notice that we are using read.csv() function to import the CSV file into the dataset vector. To confirm, you need to view the imported dataset variable. For doing this, we use the View()function.

You can use the following command to view the dataset variable.

View(dataset)

The output of the above three commands looks as follows.

step 2: Importing the dataset
step 2: Importing the dataset

This completes the Data Import process in R.

We can conclude this tutorial here. I hope this helps to understand the basic concepts when it comes to Machine Learning.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

Post 3 | ML | Data Preprocessing – Part 1

In the last tutorial, we saw the installation steps for both R and Python along with their respective IDEs. In this tutorial, we are going to start our actual journey of Machine Learning. We are going to start off with the Data Preprocessing part, which is one of the most important aspects of the Machine Learning.

We should not overlook this as it is very important and therefore I am going to dedicate a certain amount of time to explain the concepts of Data Preprocessing.

The following infographics show the different portions we are going to touch while doing the Data Preprocessing.

Machine Learning: Data Preprocessing
Machine Learning: Data Preprocessing

As you can see, it is a 7 step process and based on the efforts required, I have divided these seven steps into 6 posts so that each concept is explained clearly with specific attention to details.

In this post, we are going to cover the first two sections of the Data Preprocessing in Machine Learning. These two sections are as follows.

  • Get the Dataset
  • Import the libraries

Let us start with each of these steps then.

  • GET THE DATASET

You can get the dataset for this tutorial from my GitHub Profile under Machine Learningrepository’s data-preprocessing branch. The name of this file is Data.csv and you can download this file by clicking here. This Data.csv file looks something like this.

As you can see from the above Data.csv snippet, this file contains a total of 4 columns and 10 records. The columns are Country, Age, Salary, and Purchased. The last column Purchased is a type of flag with values Yes and No.

This Data.csv file is going to be very useful while explaining the concepts like Missing Data, Categorical variables, and feature scaling.

Now, in order to create this dataset in your system (which should be Windows, since I am using Windows for executing all the commands in this tutorial series), you can download this file from the link given above and save it in some location, which will be easily accessible for you.

Now, once you get the dataset and save it in your system and open it using the MS Excel, it looks as follows.

Input Data CSV File
Input Data CSV File

This confirms that the data was successfully downloaded and loaded into your Windows system.

Now, it is time to import the required libraries in both Python and R.

  • IMPORT REQUIRED LIBRARIES

When it comes to importing the required libraries, we need to perform this task both in Python and R. So, we will do it first in Python and then in R.

IMPORTING REQUIRED LIBRARIES IN PYTHON

We are going to import the following three libraries in Python using Spyder IDE.

numpy

matplotlib.pyplot

pandas

Let me explain the usage of each of the above library.

“numpy” library in Python contains an ability to perform mathematical operations. Since we are dealing with the Machine Learning operations, we will need to perform some of the mathematical operations from time-to-time, therefore having numpy handy will save a lot of our time and will smoothen the process.

“matplotlib.pyplot” library is useful to plot charts and graphs in Python. The name itself indicates the usage of this library. Having visual representation of the data sometimes gives us answers to those questions which we don’t have, therefore it is highly recommended to have these charts and graphs at your disposal.

“pandas” library in Python is used for dealing with the datasets. The operations like import and manage can be done with the help of this library.

To import these libraries in Python, we use the following commands on the Spyder IDE.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

where np, plt, and pd are ALIASES for numpy, matplotlib.pyplot, and pandas library respectively.

The screenshot depicting the output of the above commands is shown as follows.

Step 1: Importing Libraries
Step 1: Importing Libraries

The above screenshot shows that all the three libraries were successfully imported into the Spyder IDE for further processing.

This completes the Library import operation for Python.

Now, let us see how to perform the same in R.

IMPORTING REQUIRED LIBRARIES IN R

Interestingly, we do not need to do any library import operation in R as of now but will do so whenever required, which is easier than Python.

So, we can safely say that the library import operation is performed successfully for both Python and R.

This completes the tutorial here. In the next tutorial, we are going to see how to import the recently downloaded dataset Data.csv into both Python and R.

I hope you guys are liking the content. Please share it across your network so that it will reach more audience.

Please stay tuned for the further updates.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

Cheers!

 

Post 2 | Installations – R and Python

Hello, everyone, we are going to start off learning the concepts of Machine Learning. If you are following my blog posts on Hadoop and Big Data Analytics, then you will come to know I do give more importance on performing the hands-on exercises. Same is going to be the case for these tutorials.

Here, we are going to use both R and Python to demonstrate the concepts of Machine Learning. To use these programming languages, we first need to install those along with their recommended IDEs (Integrated Development Environment).

Just to be clear, I am going to use the WINDOWS 10 OPERATING SYSTEM throughout this tutorial series. So please make sure you also do that so that a lot of time will be saved while performing the troubleshooting operations.

The following info-graphics show this installation process.

The Big Picture Installations - R and Python
The Big Picture: Installations – R and Python

We will first do R installation and then will go on to install Python with Spyder IDE.

Let us start with the installation of R.

  • DOWNLOAD and INSTALL R

The first thing to do is to download the execution file of R.

You can download this file by clicking here. This will download R for you. The downloaded file name is going to have a pattern like R-<VERSION>-win.exe and for me, the downloaded file name is R-3.4.0-win.exe since the version is 3.4.0.

You can open this file and start the installation process for R. It is simple windows installation so I think most of us will be able to do it without much help.

  • DOWNLOAD and INSTALL RSTUDIO

Once R is installed successfully, it is time to download the RStudio. RStudio is the IDE which is used by almost everyone working on R. You can download the latest version of RStudio by clicking here. For me, version 1.0.143 got downloaded as it was the latest version of RStudio on May 22nd.

Once the file is downloaded successfully, you can install it in the same traditional way as you did install R and other software on windows OS.

Once RStudio is installed correctly, you can open it by clicking on its icon from the Start Menu or Desktop, and the screen looks like this.

RStudio Application Window
RStudio Application Window

This confirms that both R and RStudio are installed successfully.

Now, let us start with Python Installation.

  • DOWNLOAD and INSTALL ANACONDA

We do not need to download Python separately like R. In this case, we need to download Anaconda which is an open source free package manager and Python distributor. Python comes built-in with Anaconda.

You can download Anaconda Application by clicking here. The latest version as of on May 22nd is 4.3.1, therefore the downloaded file name is Anaconda3-4.3.1-Windows-x86_64.exe. Once the file is downloaded successfully, you can open it and install it like any other windows software.

Once installed successfully, you can open a program called Anaconda Navigator from the Start Menu. This is a gateway to opening the Python IDE called Spyder. The application window for Anaconda Navigator looks like as follows.

Anaconda Navigator Application Window
Anaconda Navigator Application Window

As you can see in the above picture, an application called Spyder is visible on the first line in the third column. Spyder is used as an IDE for application/code development using Python as the programming language.

You can click on the launch button, as shown in the above screenshot, to launch Spyder IDE. Once you click on launch, a new window will pop up asking to grant permission for this application to load. This window looks as follows.

Access Permission for Spyder IDE
Access Permission for Spyder IDE

You can click on Allow Access to grant access and load the Spyder IDE.

Once Spyder IDE is up and running, the application window looks like this.

Spyder Console Application Window
Spyder Console Application Window

If you are able to view the above application window, it means both Python and Spyder were installed successfully and now we are ready for writing some code.

If you have reached here without any problem, congratulations, you are ready to learn further about the concepts of Machine Learning, and eventually Artificial Intelligence.

You can reach out to me if you are facing any issues while doing the installations. I will be more than happy to help anyone out regarding this.

In the next tutorial, we are going to start off with the DATA PREPROCESSING part of the Machine Learning. I will break down this broad concept of Data Preprocessing in some sub-parts, which will tackle in individual posts.

Hope you guys are liking the content. Your suggestions and feedback are most welcome. Stay tuned for the further updates.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on twitter here and subscribe to my YouTube channel here for the video tutorials.

Cheers!

Post 1 | ML | Introduction

Hello, people. In this new tutorial series, we are going to talk about the different aspects of the Machine Learning.

As an aspiring Data Scientist, I always wanted to get my hands dirty with the concepts of Machine Learning and the Summar Break gave me exactly what I wanted – “TIME TO LEARN MACHINE LEARNING CONCEPTS“.

Through these tutorials, we are going to strengthen the Machine Learning concepts and its applications in various areas.

I am learning these concepts from the paid and verified course on Udemy Website and you can view the course contents by clicking here.

The following infographics show the timeline of this Machine Learning tutorial series.

The Timeline: Machine Learning
The Timeline: Machine Learning

This tutorial is the Introduction part of the series.

The infographics show the Big Picture of the categories that we are going to see.

So let me give you the detailed list of the tutorials we are going to cover in this series.

  1. Installation – R and Python
  2. Data Preprocessing
  3. Regression
    • Simple Linear Regression
    • Multiple Linear Regression
    • Polynomial Regression
    • Support Vector Regression
    • Decision Tree Regression
    • Random Forest Regression
    • Evaluating Regression Models Performance
  4. Classification
    • Logistic Regression
    • K-Nearest Neighbors (k-NN)
    • Support Vector Machine (SVM)
    • Kernel SVM
    • Naive Bayes
    • Decision Tree Classification
    • Random Forest Classification
    • Evaluating Classification Models Performance
  5. Clustering
    • K-Means Clustering
    • Hierarchical Clustering
  6. Association Rule Learning
    • Apriori
    • Eclat
  7. Reinforced Learning
    • Upper Confidence Bound (ECB)
    • Thompson Sampling
  8. Natural Language Processing
  9. Deep Learning
    • Artificial Neural Networks
    • Convolutional Neural Networks
  10. Dimensionality Reduction
    • Principle Component Analysis (PCA)
    • Linear Discriminant Analysis (LDA)
    • Kernel PCA
  11. Model Selection and Boosting
    • Model Selection
    • XGBoost

Hope you are not overwhelmed by what is going to come in the future. In my opinion, these all are important things a Data Scientist should know and we are going to work in that direction till we finish all of the stuff mentioned above.

So, get ready and buckle up on this beautiful journey of Machine Learning.

Please follow my blog for further updates. You can check out my LinkedIn profile here. Like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

In the next tutorial, we are going to do the installations of R and Python along with their respective IDEs.

Stay tuned. Cheers!

Post 52 | HDPCD | The conclusion

Hi everyone. Finally, we have reached the end of this tutorial series. It’s been so long. We started this journey together on January 15th, 2017, and, 276 days later this beautiful journey is coming to an end. But, we do not need to worry, because, I am working on something new and would love to share with you soon. Till then, let us discuss how to register for the exam and what are the things we should look for while giving the exam.

Let us begin then.

  • REGISTER FOR THE HDPCD CERTIFICATION

It is an 11 step process. Let us follow this step-by-step.

  • VISIT THE OFFICIAL WEBSITE

Go to https://www.examslocal.com. Once you land on this website home page, you will see the screen something like shown in the below screenshot.

Step 1: Go to https://www.examslocal.com
Step 1: Go to https://www.examslocal.com

  • CREATE AN ACCOUNT

The above screenshot shows the CREATE AN ACCOUNT button on the Right Side. Click on that button to register on their website. You will see the following screen.

2
Step 2-a: Register on the website

Fill in the entire information in the form shown in the above screenshot and then click on the Register button as shown in the below screenshot.

3
Step 2-b: Click on the Register button

Once you click on the Register button, you will be redirected to a window saying “an email confirmation has been sent to your email address”.

This screen looks as follows.

4
Step 2-c: Registration confirmation

This confirms that an email was sent to your email address.

Please login to your email ID. I did log in to my Gmail and got the following email from the website.

Step 2-d: Receiving verification email
Step 2-d: Receiving verification email

Once you receive an email like above, you must click on the Verify button as shown in the above screenshot. This link is valid for 21 days.

Once you click on the Verify link, you will be taken to the new screen as shown below.

5
Step 2-e: Receiving Account verification

Now, that your account is verified, it is time to sign in and go ahead and schedule your examination.

  • SIGN IN TO YOUR ACCOUNT

The above screenshot shows the SIGN IN button on the top right corner. Please click on it. It will take you to the login page. It looks like this.

11
Step 3: Sign into your account

Once you click on the Sign In button shown in the GREEN COLOR, you will be redirected to your landing page. This landing page looks like follows.

12
Step 4: Your landing page

Now, the next thing to do is to search for the certification and then enroll in it.

  • SEARCH FOR THE CERTIFICATION EXAM

The above screenshot shows the SEARCH BOX.  Please type “Hortonworks” in this search box and it will give the suggestions regarding the certifications given by Hortonworks.

Please have a look at the below screenshot.

13
Step 5: Choosing the correct certification name

The above screenshot shows all the suggestions given by the website. Please select “Hortonworks : HDP Certified Developer (HDPCD) – English” out of given the choices and click on the Next button shown in GREEN COLOR.

  • SELECT DATE AND TIME FOR THE CERTIFICATION

Once you click it, you will be redirected to the below screen.

141
Step 6-a: Choosing the start date to see available time slots

The above screenshot shows the certification details you are about to enroll in (BOTTOM LEFT CORNER). Please confirm on those.

Then, select the From date which you think will be a good day to give the test. Don’t forget to select the Time Zone where you want to give the test.

Once both of these fields are selected, click on the Next button shown in GREEN COLOR.

I chose October 17 (tomorrow’s date at the time of writing this article) 12:00 AM and EST as my time zone. It gave me the following screen.

Step 7-a: Choosing the time slot for the certification
Step 7-a: Choosing the time slot for the certification

As you can see from the above screenshot, the bottom left corner shows the Date and Time options you have chosen and the middle dashboard portion shows the available time slots. As soon as I clicked on October 14 option, it gave me available times for that time.

You get the screen shown in the following screenshot.

Step 7-b: Select the slot for the certification
Step 7-b: Select the slot for the certification

 

As you can see in the above screenshot, it shows me 4 options, out of which, I am selecting the 2nd option 2.15 PM – 4.15 PM EST.

  • REVIEW THE CERTIFICATION DETAILS AND CONFIRM

Once you select your preferable timings, you can click on the Next button on the bottom right corner. After clicking on the next button, you are send to the following screen.

4
Step 8: System compatibility matrix

The system compatibility matrix is shown and you have to Agree to these features. Once you make sure that the system from which you are going to give the exam has all these features, click on the I Agree button shown in the bottom right corner.

After clicking on the I Agree button, you are sent to the following screen.

5
Step 9: Exam Agreement

The Exam Agreement screen pops-up and you have to click on the I Agree button after going through the agreement (which I know, nobody does :P).

Once you click on the I Agree button, you are sent to the screen shown in the following screenshot.

6
Step 10: Candidate Additional Details

Hortonworks expects you to give your employer name before continuing to the next step.

  • PAY THE CERTIFICATION FEE

Please click on Save and Continue and you will be taken to the following screen.

7
Step 11: Payment Details

Once you enter the correct payment information, click on the Review Order button. A new screen will pop up showing you the order that you are about to place.

Once you confirm the details, click on the Pay/Submit button, which will book a slot for you. If everything goes well, you will get a confirmation email stating the date and time of the HDPCD certification.

This completes your HDPCD certification exam booking slot.

Hope the explanation made sense and the screenshots are helpful to understand the flow of the process.

  • THE HDPCD CERTIFICATION TIPS

Let us see what are the things to keep in mind during the certification.

  1. Before attempting the actual certification, I will highly recommend giving the practice test on the AWS.
  2. Exam cluster is very slow, so please be patient.
  3. For PIG, you must use Tez execution engine to save A LOT OF time.
  4. Along with individual steps, you must test the output by running the entire file in a single attempt.
  5. Do not panic if you are stuck at some question.
  6. After reading the question, first decide, whether you will be able to do it in time or not, otherwise, go for the next question to save both time and confidence. Give it some time to blend into exam mode.
  7. Do not stress out about anything. Proctor is going to help you a lot.
  8. Double check the answers before moving to next question. Do not leave anything for the end portion.

Hope these few tips come handy and help you nailing the certification.

Trust me guys, it is an easy certification, and if you go through all these tutorials, you will be able to clear it comfortably.

With this, a beautiful journey has come to an end. I am really glad and fortunate that I got to do this and help so many of you.

I am going to take a break now for the holiday season and will be back in the new year with new contents.

Till then, enjoy the holidays season. Stay safe and please let me know if you want to work on some specific technology. If possible I would like to pitch in.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

See you all in 2018!!!!!!!

hny