Hey everyone, thank you once again for keep on coming back to perform these tutorials.
In the last tutorial, we saw how to perform the simple JOIN Operation and in this tutorial, we are going to perform the REPLICATED JOIN Operation. The process is similar and there is a difference only at one place, so we do not need to worry about it too much.
In the certification, they will specifically mention when you have to perform the Replicated JOIN. If they do not mention any type of join, then please use the previous tutorial to perform the simple JOIN operation.
The following infographics show the process of performing the Replicated JOIN Operation.

Now, the above picture clearly shows the process of performing this Replicated JOIN in Apache Pig. So, let us get started with all steps, with one step at a time.
- CREATING INPUT CSV FILES IN LOCAL FILE SYSTEM
We are going to use the traditional vi editor to create the input CSV files in the local file system.
The first file contains the customers’ data with name post23_customers.csv. I have uploaded this input file to my GitHub profile under HDPCD repository with name “32_customers_input.csv“. This input CSV file can be downloaded by clicking here and it looks something like this.
As you can see from the above snippet, this file is exactly similar to the one we saw in the previous tutorial.
Please use the following commands to create this input CSV file.
vi post23_customers.csv
#####
PASTE THE COPIED CONTENTS HERE
#####cat post23_customers.csv
The following screenshot shows the output of the above commands.

The above screenshot shows that the customers’ input CSV file was created successfully in the local file system.
It is time to create the data file containing the orders information.
I have uploaded this input CSV file to my GitHub profile under HDPCD repository with name “33_orders_input.csv“. It looks as follows and you can download this file by clicking here.
You can use the following commands to create this input CSV file in the local file system.
vi post23_orders.csv
#####
PASTE THE COPIED CONTENTS HERE
#####cat post23_orders.csv
And the output of these commands looks like as follows.

The above screenshot clearly shows that this input CSV file was created successfully in the local file system.
The next logical step is to push these two files to HDFS.
- PUSHING CUSTOMERS AND ORDERS DATA TO HDFS
Please use the following commands to load these two input CSV files to HDFS.
hadoop fs -mkdir /hdpcd/input/post23
hadoop fs -put post23_customers.csv /hdpcd/input/post23
hadoop fs -put post23_orders.csv /hdpcd/input/post23
hadoop fs -cat /hdpcd/input/post23/post23_customers.csv
hadoop fs -cat /hdpcd/input/post23/post23_orders.csv
The following screenshot shows the output of the above commands.

The above screenshot shows that these two files were pushed to HDFS successfully.
The next step is to create the pig script to perform the REPLICATED JOIN between these two input CSV files.
- CREATING PIG SCRIPT TO PERFORM THE REPLICATED JOIN
The pig script for this tutorial is exactly similar to the previous one with TWO ADDITIONAL KEYWORDS in the JOIN operation/command.
This pig script is uploaded to my GitHub profile under HDPCD repository with name “34_replicated_join.pig“. You can download this pig script by clicking here and it looks as follows.
The explanation and functionality of all the commands is similar to the last tutorial. You can refer to this tutorial for the explanation of this pig script.
The only different command is as follows.
joined_data = JOIN customers BY $0, orders BY $2 USING ‘replicated’;
In the above command, the keywords USING ‘replicated’ indicates that this is the REPLICATED JOIN operation. Therefore, pig will understand that instead of running the normal JOIN operation, Replicated JOIN operation should be performed.
I hope this explanation is enough to go ahead and create the pig script.
We can use the vi editor to create this pig script by using the following commands.
vi post23.pig
#####
PASTE THE COPIED CONTENTS HERE
#####cat post23.pig
The following screenshot shows the execution of these commands.

Now that the pig script is created, it is time to run this pig script.
- RUNNING PIG SCRIPT TO PERFORM THE REPLICATED JOIN
Please use the following command to run this pig script.
pig -x tez post23.pig
The following screenshot shows the execution process of this pig script.

And the output window of the pig script execution looks as follows.

From the above screenshot, you can see that this operation is a successful operation. A total of 25 records from the post23_customers.csv file and 13 records from the post23_orders.csv file was read. Finally, the output created 4 records, as expected.
Now, let us go to HDFS and view the contents of the output HDFS directory.
- HDFS OUTPUT DIRECTORY CONTENTS
Please use the following commands to check the output contents stored in the HDFS directory /hdpcd/output/post23.
hadoop fs -ls /hdpcd/output/post23
hadoop fs -cat /hdpcd/output/post23/part-v001-o000-r-00000
And the output of these two commands looks as follows.

From the above screenshot, you can see that a total of 4 records were successfully created in the output HDFS directory.
This concludes the tutorial here. Hope you people are able to follow all the steps and come to conclusion about the objective of the tutorial.
You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on twitter here and subscribe to my YouTube channel here for the video tutorials.
Stay tuned. Cheers!
1 thought on “Post 23 | HDPCD | Perform a REPLICATED JOIN using Apache Pig”