Hi everyone, welcome to one more tutorial in this HDPCD certification series. As you might notice, I have changed the blog layout a little bit, hope you like it. Kindly let me know your feedback on this in the COMMENT SECTION.
In the last tutorial, we saw how to perform the SORT OPERATION in Apache PIG. In this tutorial, we are going to remove the duplicate tuples from a Pig Relation.
Let us start with the tutorial then.
We are going to follow the below steps.
As you can see, we are following the same pattern like most of the tutorials in this series.
Let us get started with the first step then.
- CREATING INPUT CSV FILE IN LOCAL FILE SYSTEM
We are going to use the vi editor to create this input CSV file.
I have uploaded this input CSV file to my GitHub profile under HDPCD repository with the name “24_input_for_removing_duplicates.csv” and you can download it by clicking here. This file looks as follows.
Please follow below commands to create this input CSV file in local file system.
PASTE THE ABOVE CONTENTS HERE
The following screenshot will be able to help you out regarding the execution of above commands.
Above screenshot indicates that the input CSV file was created in the local file system successfully.
Now it is time to push this input file to HDFS.
- PUSHING INPUT CSV FILE FROM LOCAL FILE SYSTEM TO HDFS
We are going to use the following commands to load this post20.csv from the local file system to HDFS.
hadoop fs -mkdir /hdpcd/input/post20
hadoop fs -put post20.csv /hdpcd/input/post20
hadoop fs -cat /hdpcd/input/post20/post20.csv
The below screenshot shows the output of the above commands.
Above screenshot indicates that the input CSV file was successfully pushed to HDFS.
Next thing to do is to create the pig script.
- CREATE PIG SCRIPT TO REMOVE DUPLICATE TUPLES FROM PIG RELATION
Once the input CSV file is ready in HDFS, it is time to create the pig script responsible for removing the duplicate tuples from the pig relation.
|— this file is used for removing the duplicate tuples from a pig relation|
|— LOAD command is used for loading the data in input file to input_data pig relation|
|— we are not passing any custom schema in this case|
|input_data = LOAD '/hdpcd/input/post20/post20.csv' USING PigStorage(',');|
|— DISTINCT command is used removing the duplicate tuples from the pig relation|
|— output is stored in unique_data pig relation|
|unique_data = DISTINCT input_data;|
|— final output is stored in|
|STORE unique_data INTO '/hdpcd/output/post20';|
Let me explain this script briefly.
input_data = LOAD ‘/hdpcd/input/post20/post20.csv’ USING PigStorage(‘,’);
LOAD command is used to load the data stored in the post20.csv file in input_data pig relation. We are not passing any custom schema while creating this pig relation.
unique_data = DISTINCT input_data;
DISTINCT command is used for removing the duplicate tuples from a pig relation. This filtered data is then loaded into a new pig relation with name unique_data.
STORE unique_data INTO ‘/hdpcd/output/post20’;
Finally, the output data in unique_data relation is stored in /hdpcd/output/post20 HDFS directory with the help of STORE command.
Hope this explanation helps.
Please follow below commands to create this pig script.
PASTE THE COPIED CONTENTS HERE
The below screenshot comes handy for this.
Above screenshot shows that the pig script was created successfully.
It is time now to run this pig script.
- RUN PIG SCRIPT
Please use the below command to run this pig script.
pig -x tez post20.pig
The following screenshot shows the process of running this pig script.
And the output of this script execution looks like this.
As you can see from the above screenshot, the pig script executed successfully.
A total of 7 records were sent as an input to the script and the output generated only 5 records, as expected, removing the duplicate records in the input CSV file.
This confirms that we were able to perform this objective successfully.
Let us take a look at the output HDFS directory.
- OBSERVE THE OUTPUT FOR REMOVAL OF DUPLICATE TUPLES
The following commands are used to check the output HDFS directory.
hadoop fs -ls /hdpcd/output/post20
hadoop fs -cat /hdpcd/output/post20/part-v001-o000-r-00000
The below screenshot indicates the execution of the above commands.
As you can see from the above screenshot, the duplicate tuples were deleted/removed from the input file and we are getting all the unique records in the output HDFS file.
We can conclude this tutorial right here. Hope you guys like the content and explanation. In the next tutorial, we are going to see how to specify the number of reduce tasks for a pig MapReduce job.
Stay tuned for the updates.
Please follow my blog for receiving regular updates. You can subscribe to my YouTube channel for the video tutorials by clicking here. You can like my Facebook page here. You can check out my LinkedIn profile here and follow me on twitter here.