Post 23 | HDPCD | Perform a REPLICATED JOIN using Apache Pig

Hey everyone, thank you once again for keep on coming back to perform these tutorials.

In the last tutorial, we saw how to perform the simple JOIN Operation and in this tutorial, we are going to perform the REPLICATED JOIN Operation.  The process is similar and there is a difference only at one place, so we do not need to worry about it too much.

In the certification, they will specifically mention when you have to perform the Replicated JOIN. If they do not mention any type of join, then please use the previous tutorial to perform the simple JOIN operation.

The following infographics show the process of performing the Replicated JOIN Operation.

The Big Picture: REPLICATED JOIN in Apache PIG
The Big Picture: REPLICATED JOIN in Apache PIG

Now, the above picture clearly shows the process of performing this Replicated JOIN in Apache Pig. So, let us get started with all steps, with one step at a time.

  • CREATING INPUT CSV FILES IN LOCAL FILE SYSTEM

We are going to use the traditional vi editor to create the input CSV files in the local file system.

The first file contains the customers’ data with name post23_customers.csv. I have uploaded this input file to my GitHub profile under HDPCD repository with name 32_customers_input.csv“. This input CSV file can be downloaded by clicking here and it looks something like this.


3098 Mary Smith XXXXXXXXX XXXXXXXXX 8217 Fallen Panda Walk Newburgh NY 12550
3099 Brittany Copeland XXXXXXXXX XXXXXXXXX 5735 Round Beacon Terrace Caguas PR 00725
3100 Mary Smith XXXXXXXXX XXXXXXXXX 5436 Grand Hickory Farm Huntington Park CA 90255
3101 George Reyes XXXXXXXXX XXXXXXXXX 8702 Silver Apple Square Mission Viejo CA 92692
3102 Ralph Dixon XXXXXXXXX XXXXXXXXX 5633 Harvest Turnabout Caguas PR 00725
3103 Mary Wilkins XXXXXXXXX XXXXXXXXX 1213 Cotton Pike Spring Valley NY 10977
3104 Megan Smith XXXXXXXXX XXXXXXXXX 5292 Shady Pony Cape Caguas PR 00725
3105 Mary Stone XXXXXXXXX XXXXXXXXX 8510 Green River Acres Toa Baja PR 00949
3106 Samantha Smith XXXXXXXXX XXXXXXXXX 355 Cozy Square Las Cruces NM 88005
3107 Tiffany Estes XXXXXXXXX XXXXXXXXX 5182 Cotton Heath Caguas PR 00725
3108 Mary Smith XXXXXXXXX XXXXXXXXX 577 Rustic Nectar Row Houston TX 77083
3109 Jack James XXXXXXXXX XXXXXXXXX 5876 Burning Mall Fort Worth TX 76133

view raw
post23_customers.csv
hosted with ❤ by GitHub

As you can see from the above snippet, this file is exactly similar to the one we saw in the previous tutorial.

Please use the following commands to create this input CSV file.

vi post23_customers.csv

#####
PASTE THE COPIED CONTENTS HERE
#####

cat post23_customers.csv

The following screenshot shows the output of the above commands.

Step 1 :Creating customers input file in local file system
Step 1: Creating customers input file in local file system

The above screenshot shows that the customers’ input CSV file was created successfully in the local file system.

It is time to create the data file containing the orders information.

I have uploaded this input CSV file to my GitHub profile under HDPCD repository with name 33_orders_input.csv“. It looks as follows and you can download this file by clicking here.


49354 2014-05-28 00:00:00.0 3091 PENDING
52516 2014-06-19 00:00:00.0 3098 CLOSED
52736 2014-06-20 00:00:00.0 3094 COMPLETE
53505 2014-06-27 00:00:00.0 3099 PROCESSING
55795 2014-07-13 00:00:00.0 3099 COMPLETE
55938 2014-07-14 00:00:00.0 3095 ON_HOLD
56678 2014-07-18 00:00:00.0 3091 PENDING_PAYMENT
57402 2014-07-22 00:00:00.0 3095 PENDING
57513 2014-07-23 00:00:00.0 3097 COMPLETE
57942 2013-07-31 00:00:00.0 3095 PENDING_PAYMENT
58639 2013-08-27 00:00:00.0 3090 COMPLETE
60178 2013-10-28 00:00:00.0 3090 COMPLETE
60594 2013-11-14 00:00:00.0 3099 PENDING_PAYMENT

view raw
post23_orders.csv
hosted with ❤ by GitHub

You can use the following commands to create this input CSV file in the local file system.

vi post23_orders.csv

#####
PASTE THE COPIED CONTENTS HERE
#####

cat post23_orders.csv

And the output of these commands looks like as follows.

Step 2: creating orders input file in local file system
Step 2: creating orders input file in local file system

The above screenshot clearly shows that this input CSV file was created successfully in the local file system.

The next logical step is to push these two files to HDFS.

  • PUSHING CUSTOMERS AND ORDERS DATA TO HDFS

Please use the following commands to load these two input CSV files to HDFS.

hadoop fs -mkdir /hdpcd/input/post23
hadoop fs -put post23_customers.csv /hdpcd/input/post23
hadoop fs -put post23_orders.csv /hdpcd/input/post23
hadoop fs -cat /hdpcd/input/post23/post23_customers.csv
hadoop fs -cat /hdpcd/input/post23/post23_orders.csv

The following screenshot shows the output of the above commands.

Step 3: pushing input csv files to HDFS
Step 3: pushing input csv files to HDFS

The above screenshot shows that these two files were pushed to HDFS successfully.

The next step is to create the pig script to perform the REPLICATED JOIN between these two input CSV files.

  • CREATING PIG SCRIPT TO PERFORM THE REPLICATED JOIN

The pig script for this tutorial is exactly similar to the previous one with TWO ADDITIONAL KEYWORDS in the JOIN operation/command.

This pig script is uploaded to my GitHub profile under HDPCD repository with name 34_replicated_join.pig“. You can download this pig script by clicking here and it looks as follows.

–REPLICATED JOIN OPERATION IN APACHE PIG
–loading customers' data in customers relation
customers = LOAD '/hdpcd/input/post23/post23_customers.csv' USING PigStorage(',');
–loading orders' data in orders relation
orders = LOAD '/hdpcd/input/post23/post23_orders.csv' USING PigStorage(',');
–performing replicated join operation based on customer ID
–customer ID is the first column in customers relation, therefore $0
–customer ID is the third column in orders relation, therefore $2
joined_data = JOIN customers BY $0, orders BY $2 USING 'replicated';
–generating output data with FOREACH…GENERATE command
–output contains customers' first name, last name, order ID, and payment status of the order
output_data = FOREACH joined_data GENERATE $1 AS fname, $2 AS lname, $8 AS orderid,$12 AS payment_status;
–storing the final output in HDFS
STORE output_data INTO '/hdpcd/output/post23/';

view raw
post23.pig
hosted with ❤ by GitHub

The explanation and functionality of all the commands is similar to the last tutorial. You can refer to this tutorial for the explanation of this pig script.

The only different command is as follows.

joined_data = JOIN customers BY $0, orders BY $2 USING ‘replicated’;

In the above command, the keywords USING ‘replicated’ indicates that this is the REPLICATED JOIN operation. Therefore, pig will understand that instead of running the normal JOIN operation, Replicated JOIN operation should be performed.

I hope this explanation is enough to go ahead and create the pig script.

We can use the vi editor to create this pig script by using the following commands.

vi post23.pig

#####
PASTE THE COPIED CONTENTS HERE
#####

cat post23.pig

The following screenshot shows the execution of these commands.

Step 4: creating pig script to perform replicated join
Step 4: creating pig script to perform replicated join

Now that the pig script is created, it is time to run this pig script.

  • RUNNING PIG SCRIPT TO PERFORM THE REPLICATED JOIN

Please use the following command to run this pig script.

pig -x tez post23.pig

The following screenshot shows the execution process of this pig script.

Step 5: running pig script to perform replicated join
Step 5: running pig script to perform replicated join

And the output window of the pig script execution looks as follows.

Step 5: pig script execution output
Step 5: pig script execution output

From the above screenshot, you can see that this operation is a successful operation. A total of 25 records from the post23_customers.csv file and 13 records from the post23_orders.csv file was read. Finally, the output created 4 records, as expected.

Now, let us go to HDFS and view the contents of the output HDFS directory.

  • HDFS OUTPUT DIRECTORY CONTENTS

Please use the following commands to check the output contents stored in the HDFS directory /hdpcd/output/post23.

hadoop fs -ls /hdpcd/output/post23
hadoop fs -cat /hdpcd/output/post23/part-v001-o000-r-00000

And the output of these two commands looks as follows.

Step 6: output HDFS directory contents
Step 6: output HDFS directory contents

From the above screenshot, you can see that a total of 4 records were successfully created in the output HDFS directory.

This concludes the tutorial here. Hope you people are able to follow all the steps and come to conclusion about the objective of the tutorial.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on twitter here and subscribe to my YouTube channel here for the video tutorials.

Stay tuned. Cheers!

Published by milindjagre

I founded my blog www.milindjagre.co four years ago and am currently working as a Data Scientist Analyst at the Ford Motor Company. I graduated from the University of Connecticut pursuing Master of Science in Business Analytics and Project Management. I am working hard and learning a lot of new things in the field of Data Science. I am a strong believer of constant and directional efforts keeping the teamwork at the highest priority. Please reach out to me at milindjagre@gmail.com for further information. Cheers!

One thought on “Post 23 | HDPCD | Perform a REPLICATED JOIN using Apache Pig

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: