Hey, everyone, it’s been so long. Been busy with the final exams and presentations for the last couple of weeks. Now that the semester is over, you can expect frequent and more detailed updates from this blog. So, let us get started then.
In the last tutorial, we saw the GROUP operation using Apache Pig. This tutorial focuses on removing records which contain NULL values. It is one of the steps in data preprocessing which we use while doing the Text Mining.
Let us take a look at the steps we are going to follow to achieve it.
Above flowchart shows us the step by step process of solving this objective in the HDPCD certification. We will perform each task in a sequence in the following way.
- INPUT FILE CREATION
The input file is created in the local file system with the help of vi editor. For demonstration purposes, I have deliberately put NULL values in the file. I have uploaded this file to my GitHub profile in HDPCD repository and you can download it from here.
Once you do download the file, the following commands can be used to create this in your local file system.
PASTE THE CONTENTS HERE
The following screenshot might be helpful for you.
This input file looks something like this.
|Hi This is Milind.|
|Hi Big Data|
|This file contains 2 empty lines|
|and 3 non empty lines|
- PUSHING FILE TO HDFS
Now, the file is in the local file system. Let us push it to HDFS.
We are going to use the following set of commands to do this.
hadoop fs -mkdir /hdpcd/input/post16
hadoop fs -put post16.csv /hdpcd/input/post16
hadoop fs -cat /hdpcd/input/post16/post16.csv
The following screenshot will give you an idea about the execution of the above commands.
It looks something like this.
|–Removing records with NULL values in pig relation|
|–loading the data in input_data relation|
|input_data = LOAD '/hdpcd/input/post16/post16.csv' USING PigStorage() AS (line:chararray);|
|–performing filter operation to remove records with NULL values|
|filtered_data = FILTER input_data BY line IS NOT NULL;|
|–storing the final output in HDFS|
|STORE filtered_data INTO '/hdpcd/output/post16';|
Let us look at each command.
input_data = LOAD ‘/hdpcd/input/post16/post16.csv’ USING PigStorage() AS (line:chararray);
Above command is responsible for loading the data present in the post16.csv file into the pig relation input_data. The variable name is line and the datatype of line variable is chararray. This means that each line is denoted by line variable with chararray datatype.
filtered_data = FILTER input_data BY line IS NOT NULL;
Above command is responsible for removing the records which contain the NULL values in any of the columns. If you remember, our input file contains a total of 6 lines, out of which 2 are empty and 4 lines are non-empty. Therefore once we execute the above command, a total of 4 lines will remain and 2 lines will be removed.
STORE filtered_data INTO ‘/hdpcd/output/post16’;
This command is going to store the input data into the HDFS output directory /hdpcd/output/post16.
I hope the above explanation makes sense.
Once you download post16.pig file, you can use the following commands in the vi editor to create this file.
PASTE THE CONTENTS HERE
The following screenshot might come handy.
So, we are ready to run this pig script.
- RUNNING PIG SCRIPT
Now, it is time to run the above-created pig script. We will use the following command to run this pig script.
pig -x tez post16.pig
The below screenshot shows us the output of the above command.
Once the script, we will get some output and this output looks like this.
From the above screenshot, we can clearly see that our pig script ran successfully and a total of 4 records (4 lines) got stored in the output file under /hdpcd/output/post16 HDFS directory.
After this, let us look at the output.
- OBSERVE THE OUTPUT
We will go through the HDFS directory /hdpcd/output/post16 and print the contents of the output file to see the results.
For doing this, we will run the following commands.
hadoop fs -ls /hdpcd/output/post16
hadoop fs -cat /hdpcd/output/post16/part-v000-o000-r-00000
Let us observe the output of the above two commands.
The above screenshot confirms that the output is as per our expectations and we can conclude this tutorial here.
I hope you liked the content. In the next tutorial, we are going to see how to store the output data in HDFS, like we are doing in every post.
Please follow my blog for further updates. You can click here to subscribe to my YouTube channel. You can like here my facebook page here and follow me on twitter here. Please check out my LinkedIn profile here.
Have fun people. Cheers!