Post 22 | HDPCD | Join two datasets using Apache Pig

Hey, everyone. Thanks for the overwhelming response to the blog posts that I am receiving since the last week. I really appreciate it. I will keep on posting interesting and innovative contents for you.

In the last tutorial, we saw how to use the parallel features of Apache Pig in two ways. In this tutorial, we are going to focus on the JOIN operation and will see the process of performing this JOIN operation between the two datasets to produce the required output.

Let us start then.

Big Picture: join-two-datasets-using-apache-pig
Big Picture: join-two-datasets-using-apache-pig

As you can see from the above picture, the process we are following to achieve this is quite similar and traditional to the one which we are using for almost all the tutorials.

We will start off by creating the input CSV files first.

  • CREATING INPUT CSV FILES IN LOCAL FILE SYSTEM

As already done in the past, the input CSV files are created with the help of vi editor in the local file system.

Since two data sets are involved in this tutorial, we are going to create two input CSV files in the local file system.

I have uploaded both of these files to my GitHub profile under HDPCD repository. The names of these two files are 29_customers_input.csv and 30_orders_input.csv“.

The file post22_customers.csv looks as follows and you can download it from by clicking here.


3098 Mary Smith XXXXXXXXX XXXXXXXXX 8217 Fallen Panda Walk Newburgh NY 12550
3099 Brittany Copeland XXXXXXXXX XXXXXXXXX 5735 Round Beacon Terrace Caguas PR 00725
3100 Mary Smith XXXXXXXXX XXXXXXXXX 5436 Grand Hickory Farm Huntington Park CA 90255
3101 George Reyes XXXXXXXXX XXXXXXXXX 8702 Silver Apple Square Mission Viejo CA 92692
3102 Ralph Dixon XXXXXXXXX XXXXXXXXX 5633 Harvest Turnabout Caguas PR 00725
3103 Mary Wilkins XXXXXXXXX XXXXXXXXX 1213 Cotton Pike Spring Valley NY 10977
3104 Megan Smith XXXXXXXXX XXXXXXXXX 5292 Shady Pony Cape Caguas PR 00725
3105 Mary Stone XXXXXXXXX XXXXXXXXX 8510 Green River Acres Toa Baja PR 00949
3106 Samantha Smith XXXXXXXXX XXXXXXXXX 355 Cozy Square Las Cruces NM 88005
3107 Tiffany Estes XXXXXXXXX XXXXXXXXX 5182 Cotton Heath Caguas PR 00725
3108 Mary Smith XXXXXXXXX XXXXXXXXX 577 Rustic Nectar Row Houston TX 77083
3109 Jack James XXXXXXXXX XXXXXXXXX 5876 Burning Mall Fort Worth TX 76133

view raw
post22_customers.csv
hosted with ❤ by GitHub

We can use the traditional vi editor for creating this file. Please use the following commands for doing this.

vi post22_customers.csv

#####
PASTE THE COPIED CONTENTS HERE
#####

cat post22_customers.csv

The following screenshot might come handy for running these commands.

Step 1: creating customers input file in local file system
Step 1: creating customers input file in local file system

Now, it is time for the second input file.

The file post22_orders.csv looks as follows and you can download it from by clicking here.


49354 2014-05-28 00:00:00.0 3091 PENDING
52516 2014-06-19 00:00:00.0 3098 CLOSED
52736 2014-06-20 00:00:00.0 3094 COMPLETE
53505 2014-06-27 00:00:00.0 3099 PROCESSING
55795 2014-07-13 00:00:00.0 3099 COMPLETE
55938 2014-07-14 00:00:00.0 3095 ON_HOLD
56678 2014-07-18 00:00:00.0 3091 PENDING_PAYMENT
57402 2014-07-22 00:00:00.0 3095 PENDING
57513 2014-07-23 00:00:00.0 3097 COMPLETE
57942 2013-07-31 00:00:00.0 3095 PENDING_PAYMENT
58639 2013-08-27 00:00:00.0 3090 COMPLETE
60178 2013-10-28 00:00:00.0 3090 COMPLETE
60594 2013-11-14 00:00:00.0 3099 PENDING_PAYMENT

view raw
post22_orders.csv
hosted with ❤ by GitHub

The same vi editor is used for creating this input file.

vi post22_orders.csv

#####
PASTE THE COPIED CONTENTS HERE
#####

cat post22_orders.csv

Please have a look at the below screenshot for the execution of the above commands.

Step 2: creating orders input file in local file system
Step 2: creating orders input file in local file system
  • PUSHING INPUT CSV FILES TO HDFS

Now is the time to push these input CSV files to HDFS.

We are going to use the following set of commands for doing this.

hadoop fs -mkdir /hdpcd/input/post22
hadoop fs -put post22_customers.csv /hdpcd/input/post22
hadoop fs -put post22_orders.csv /hdpcd/input/post22
hadoop fs -cat /hdpcd/input/post22/post22_customers.csv
hadoop fs -cat /hdpcd/input/post22/post22_orders.csv

The following screenshot might come handy to keep track of the execution of the above commands.

Step 3: pushing input CSV files from local file system to HDFS
Step 3: pushing input CSV files from local file system to HDFS

Now that both the input CSV files are pushed to HDFS, it is time to create the pig script responsible for performing the JOIN operation between customers and orders data.

  • CREATING PIG SCRIPT FOR JOIN OPERATION

I have uploaded both of these files to my GitHub profile under HDPCD repository. The names of these two files are 31_join_operation.pig“. 

This pig script looks as follows and you can download this file by clicking here.

–JOIN OPERATION IN APACHE PIG
–loading customers' data in customers relation
customers = LOAD '/hdpcd/input/post22/post22_customers.csv' USING PigStorage(',');
–loading orders' data in orders relation
orders = LOAD '/hdpcd/input/post22/post22_orders.csv' USING PigStorage(',');
–performing join operation based on customer ID
–customer ID is the first column in customers relation, therefore $0
–customer ID is the third column in orders relation, therefore $2
joined_data = JOIN customers BY $0, orders BY $2;
–generating output data with FOREACH…GENERATE command
–output contains customers' first name, last name, order ID, and payment status of the order
output_data = FOREACH joined_data GENERATE $1 AS fname, $2 AS lname, $8 AS orderid,$12 AS payment_status;
–storing the final output in HDFS
STORE output_data INTO '/hdpcd/output/post22/';

view raw
post22.pig
hosted with ❤ by GitHub

Let us go through each line in this pig script.

customers = LOAD ‘/hdpcd/input/post22/post22_customers.csv’ USING PigStorage(‘,’);

Above LOAD command is used for loading the data stored in post22_customers.csv file into a pig relation called customers.

orders = LOAD ‘/hdpcd/input/post22/post22_orders.csv’ USING PigStorage(‘,’);

Above LOAD command is used for loading the data stored in post22_orders.csv file into a pig relation called orders.

joined_data = JOIN customers BY $0, orders BY $2;

Above JOIN command contains the gist of this tutorial. JOIN command is used for performing the JOIN operation in Apache Pig. The syntax is quite simple. You must include all the tables that you want to be a part of the JOIN operation along with the columns on which you want to perform the JOIN operation. The columns are mentioned with the help of their indices in the customers and orders pig relation. The result is stored in the pig relation called joined_data.

output_data = FOREACH joined_data GENERATE $1 AS fname, $2 AS lname, $8 AS orderid,$12 AS payment_status;

The above FOREACH statement is used for iterating over the joined_data pig relation. As the final output should contain the first name, last name, order ID, and the payment status, only those columns are extracted from the joined_data pig relation. This result is stored in the output_data pig relation.

STORE output_data INTO ‘/hdpcd/output/post22/’;

The output_data pig relation is then stored into HDFS directory /hdpcd/output/post22 with the help of STORE command.

I believe, the above explanation is sufficient to move forward with the creation and execution of the pig script.

You can use the following commands to create this pig script.

vi post22.pig

#####
PASTE THE COPIED CONTENTS HERE
#####

cat post22.pig

The below screenshot might come handy for doing this.

Step 4: creating pig script for join operation
Step 4: creating pig script for join operation

It is time to run this pig script.

  • RUNNING THE PIG SCRIPT

You can use the following command to run this pig script.

pig -x tez post22.pig

The pig script starts the execution in the following way.

Step 5: pig script execution command
Step 5: pig script execution command

And the output of this pig script looks as follows.

Step 5: pig script execution output
Step 5: pig script execution output

You can see from the above screenshot that it was a successful operation. A total of 12 records were read from the customers file and 13 records were read from the orders file. The JOIN operation caused only 4 records in the output HDFS directory as there were only 4 matching entries for the order ID in both the files.

Now, the last thing remaining is to view the data stored in HDFS output directory.

  • OUTPUT HDFS DIRECTORY

For viewing the data stored in the output HDFS directory, we are going to use the following two commands.

hadoop fs -ls /hdpcd/output/post22
hadoop fs -cat /hdpcd/output/post22/part-v002-o000-r-00000

The following screenshot shows the output of these two commands.

Step 6: output HDFS directory contents
Step 6: output HDFS directory contents

As you can see from the above screenshot, the output file contains a total of 4 records, as expected. The output file contains the first name, last name, order ID, and the payment status, as expected.

With this, we can say that the objective of this tutorial is met and we can conclude this section here.

I hope the explanation and the screenshot are helping you to understand the concepts of each tutorial. Stay tuned for the further updates.

In the next tutorial, we are going to perform the replicated join using apache pig.

Please visit my LinkedIn profile here. Like my Facebook page here and follow me on twitter here. You can subscribe to my YouTube channel here.

See you soon.

Cheers!

 

Published by milindjagre

I founded my blog www.milindjagre.co four years ago and am currently working as a Data Scientist Analyst at the Ford Motor Company. I graduated from the University of Connecticut pursuing Master of Science in Business Analytics and Project Management. I am working hard and learning a lot of new things in the field of Data Science. I am a strong believer of constant and directional efforts keeping the teamwork at the highest priority. Please reach out to me at milindjagre@gmail.com for further information. Cheers!

One thought on “Post 22 | HDPCD | Join two datasets using Apache Pig

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: