Hey, everyone. Thanks for the overwhelming response to the blog posts that I am receiving since the last week. I really appreciate it. I will keep on posting interesting and innovative contents for you.
In the last tutorial, we saw how to use the parallel features of Apache Pig in two ways. In this tutorial, we are going to focus on the JOIN operation and will see the process of performing this JOIN operation between the two datasets to produce the required output.
Let us start then.

As you can see from the above picture, the process we are following to achieve this is quite similar and traditional to the one which we are using for almost all the tutorials.
We will start off by creating the input CSV files first.
- CREATING INPUT CSV FILES IN LOCAL FILE SYSTEM
As already done in the past, the input CSV files are created with the help of vi editor in the local file system.
Since two data sets are involved in this tutorial, we are going to create two input CSV files in the local file system.
I have uploaded both of these files to my GitHub profile under HDPCD repository. The names of these two files are “29_customers_input.csv“ and “30_orders_input.csv“.
The file post22_customers.csv looks as follows and you can download it from by clicking here.
3098 | Mary | Smith | XXXXXXXXX | XXXXXXXXX | 8217 Fallen Panda Walk | Newburgh | NY | 12550 | |
---|---|---|---|---|---|---|---|---|---|
3099 | Brittany | Copeland | XXXXXXXXX | XXXXXXXXX | 5735 Round Beacon Terrace | Caguas | PR | 00725 | |
3100 | Mary | Smith | XXXXXXXXX | XXXXXXXXX | 5436 Grand Hickory Farm | Huntington Park | CA | 90255 | |
3101 | George | Reyes | XXXXXXXXX | XXXXXXXXX | 8702 Silver Apple Square | Mission Viejo | CA | 92692 | |
3102 | Ralph | Dixon | XXXXXXXXX | XXXXXXXXX | 5633 Harvest Turnabout | Caguas | PR | 00725 | |
3103 | Mary | Wilkins | XXXXXXXXX | XXXXXXXXX | 1213 Cotton Pike | Spring Valley | NY | 10977 | |
3104 | Megan | Smith | XXXXXXXXX | XXXXXXXXX | 5292 Shady Pony Cape | Caguas | PR | 00725 | |
3105 | Mary | Stone | XXXXXXXXX | XXXXXXXXX | 8510 Green River Acres | Toa Baja | PR | 00949 | |
3106 | Samantha | Smith | XXXXXXXXX | XXXXXXXXX | 355 Cozy Square | Las Cruces | NM | 88005 | |
3107 | Tiffany | Estes | XXXXXXXXX | XXXXXXXXX | 5182 Cotton Heath | Caguas | PR | 00725 | |
3108 | Mary | Smith | XXXXXXXXX | XXXXXXXXX | 577 Rustic Nectar Row | Houston | TX | 77083 | |
3109 | Jack | James | XXXXXXXXX | XXXXXXXXX | 5876 Burning Mall | Fort Worth | TX | 76133 |
We can use the traditional vi editor for creating this file. Please use the following commands for doing this.
vi post22_customers.csv
#####
PASTE THE COPIED CONTENTS HERE
#####cat post22_customers.csv
The following screenshot might come handy for running these commands.

Now, it is time for the second input file.
The file post22_orders.csv looks as follows and you can download it from by clicking here.
49354 | 2014-05-28 00:00:00.0 | 3091 | PENDING | |
---|---|---|---|---|
52516 | 2014-06-19 00:00:00.0 | 3098 | CLOSED | |
52736 | 2014-06-20 00:00:00.0 | 3094 | COMPLETE | |
53505 | 2014-06-27 00:00:00.0 | 3099 | PROCESSING | |
55795 | 2014-07-13 00:00:00.0 | 3099 | COMPLETE | |
55938 | 2014-07-14 00:00:00.0 | 3095 | ON_HOLD | |
56678 | 2014-07-18 00:00:00.0 | 3091 | PENDING_PAYMENT | |
57402 | 2014-07-22 00:00:00.0 | 3095 | PENDING | |
57513 | 2014-07-23 00:00:00.0 | 3097 | COMPLETE | |
57942 | 2013-07-31 00:00:00.0 | 3095 | PENDING_PAYMENT | |
58639 | 2013-08-27 00:00:00.0 | 3090 | COMPLETE | |
60178 | 2013-10-28 00:00:00.0 | 3090 | COMPLETE | |
60594 | 2013-11-14 00:00:00.0 | 3099 | PENDING_PAYMENT |
The same vi editor is used for creating this input file.
vi post22_orders.csv
#####
PASTE THE COPIED CONTENTS HERE
#####cat post22_orders.csv
Please have a look at the below screenshot for the execution of the above commands.

- PUSHING INPUT CSV FILES TO HDFS
Now is the time to push these input CSV files to HDFS.
We are going to use the following set of commands for doing this.
hadoop fs -mkdir /hdpcd/input/post22
hadoop fs -put post22_customers.csv /hdpcd/input/post22
hadoop fs -put post22_orders.csv /hdpcd/input/post22
hadoop fs -cat /hdpcd/input/post22/post22_customers.csv
hadoop fs -cat /hdpcd/input/post22/post22_orders.csv
The following screenshot might come handy to keep track of the execution of the above commands.

Now that both the input CSV files are pushed to HDFS, it is time to create the pig script responsible for performing the JOIN operation between customers and orders data.
- CREATING PIG SCRIPT FOR JOIN OPERATION
I have uploaded both of these files to my GitHub profile under HDPCD repository. The names of these two files are “31_join_operation.pig“.
This pig script looks as follows and you can download this file by clicking here.
–JOIN OPERATION IN APACHE PIG | |
–loading customers' data in customers relation | |
customers = LOAD '/hdpcd/input/post22/post22_customers.csv' USING PigStorage(','); | |
–loading orders' data in orders relation | |
orders = LOAD '/hdpcd/input/post22/post22_orders.csv' USING PigStorage(','); | |
–performing join operation based on customer ID | |
–customer ID is the first column in customers relation, therefore $0 | |
–customer ID is the third column in orders relation, therefore $2 | |
joined_data = JOIN customers BY $0, orders BY $2; | |
–generating output data with FOREACH…GENERATE command | |
–output contains customers' first name, last name, order ID, and payment status of the order | |
output_data = FOREACH joined_data GENERATE $1 AS fname, $2 AS lname, $8 AS orderid,$12 AS payment_status; | |
–storing the final output in HDFS | |
STORE output_data INTO '/hdpcd/output/post22/'; |
Let us go through each line in this pig script.
customers = LOAD ‘/hdpcd/input/post22/post22_customers.csv’ USING PigStorage(‘,’);
Above LOAD command is used for loading the data stored in post22_customers.csv file into a pig relation called customers.
orders = LOAD ‘/hdpcd/input/post22/post22_orders.csv’ USING PigStorage(‘,’);
Above LOAD command is used for loading the data stored in post22_orders.csv file into a pig relation called orders.
joined_data = JOIN customers BY $0, orders BY $2;
Above JOIN command contains the gist of this tutorial. JOIN command is used for performing the JOIN operation in Apache Pig. The syntax is quite simple. You must include all the tables that you want to be a part of the JOIN operation along with the columns on which you want to perform the JOIN operation. The columns are mentioned with the help of their indices in the customers and orders pig relation. The result is stored in the pig relation called joined_data.
output_data = FOREACH joined_data GENERATE $1 AS fname, $2 AS lname, $8 AS orderid,$12 AS payment_status;
The above FOREACH statement is used for iterating over the joined_data pig relation. As the final output should contain the first name, last name, order ID, and the payment status, only those columns are extracted from the joined_data pig relation. This result is stored in the output_data pig relation.
STORE output_data INTO ‘/hdpcd/output/post22/’;
The output_data pig relation is then stored into HDFS directory /hdpcd/output/post22 with the help of STORE command.
I believe, the above explanation is sufficient to move forward with the creation and execution of the pig script.
You can use the following commands to create this pig script.
vi post22.pig
#####
PASTE THE COPIED CONTENTS HERE
#####cat post22.pig
The below screenshot might come handy for doing this.

It is time to run this pig script.
- RUNNING THE PIG SCRIPT
You can use the following command to run this pig script.
pig -x tez post22.pig
The pig script starts the execution in the following way.

And the output of this pig script looks as follows.

You can see from the above screenshot that it was a successful operation. A total of 12 records were read from the customers file and 13 records were read from the orders file. The JOIN operation caused only 4 records in the output HDFS directory as there were only 4 matching entries for the order ID in both the files.
Now, the last thing remaining is to view the data stored in HDFS output directory.
- OUTPUT HDFS DIRECTORY
For viewing the data stored in the output HDFS directory, we are going to use the following two commands.
hadoop fs -ls /hdpcd/output/post22
hadoop fs -cat /hdpcd/output/post22/part-v002-o000-r-00000
The following screenshot shows the output of these two commands.

As you can see from the above screenshot, the output file contains a total of 4 records, as expected. The output file contains the first name, last name, order ID, and the payment status, as expected.
With this, we can say that the objective of this tutorial is met and we can conclude this section here.
I hope the explanation and the screenshot are helping you to understand the concepts of each tutorial. Stay tuned for the further updates.
In the next tutorial, we are going to perform the replicated join using apache pig.
Please visit my LinkedIn profile here. Like my Facebook page here and follow me on twitter here. You can subscribe to my YouTube channel here.
See you soon.
Cheers!
One thought on “Post 22 | HDPCD | Join two datasets using Apache Pig”