Post 45 | HDPCD | Join two Hive tables

 

Hello, everyone. Welcome to one more tutorial in the HDPCD certification series.

In the last tutorial, we saw how to insert a new row into a Hive table.

In this tutorial, we are going to see how to join two Hive tables.

Let us begin, then.

Apache Hive: Performing join operation
Apache Hive: Performing join operation

The above info-graphics show the step by step process of performing the JOIN operation in Apache Hive.

Let’s start doing the JOIN, then.

  • CREATING INPUT CSV FILES IN THE LOCAL FILE SYSTEM

We use the vi editor to create input CSV files in the local file system.

Here, we are going to create two input files, since, we are joining two tables in this tutorial.

The first file, named, post45_a.csv, contains the customer information, such as, customer ID, customer name, and customer gender.

I have uploaded this input CSV file to my GitHub profile under HDPCD repository. The name of this file is “56_first_input_file_for_join.csv“. You can download/save this file by clicking here and it looks as follows.


1 tina F
2 jerry M
3 tom M
4 wonder woman F
5 scoobydoo M
6 donald duck M
7 pink panther F
8 oggy M
9 shinchan M

view raw
post45_a.csv
hosted with ❤ by GitHub

You can use the following commands for creating this file.

vi post45_a.csv

###############################
# PASTE COPIED CONTENTS HERE #
###############################

cat post45_a.csv

The output of the above commands look as follows.

Step 1: Creating first input file in the local file system
Step 1: Creating first input file in the local file system

As you can see from the above screenshot, input CSV file post45_a.csv was created successfully in the local file system.

We will follow the same process for creating the second input CSV file.

The name of this file is post45_b.csv and it contains the order information by the customers. It contains fields such as customer ID, timestamp, order amount, and the order status.

I have uploaded this input CSV file to my GitHub profile under HDPCD repository. The name of this file is “57_second_input_file_for_join.csv“. You can download/save this file by clicking here and it looks as follows.


1 2013-07-25 00:00:00.0 11599 CLOSED
2 2013-07-25 00:00:00.0 256 PENDING_PAYMENT
3 2013-07-25 00:00:00.0 12111 COMPLETE
4 2013-07-25 00:00:00.0 8827 CLOSED
5 2013-07-25 00:00:00.0 11318 COMPLETE
6 2013-07-25 00:00:00.0 7130 COMPLETE
7 2013-07-25 00:00:00.0 4530 COMPLETE
8 2013-07-25 00:00:00.0 2911 PROCESSING
9 2013-07-25 00:00:00.0 5657 PENDING_PAYMENT
10 2013-07-25 00:00:00.0 5648 PENDING_PAYMENT
11 2013-07-25 00:00:00.0 918 PAYMENT_REVIEW
12 2013-07-25 00:00:00.0 1837 CLOSED
13 2013-07-25 00:00:00.0 9149 PENDING_PAYMENT
14 2013-07-25 00:00:00.0 9842 PROCESSING
15 2013-07-25 00:00:00.0 2568 COMPLETE
16 2013-07-25 00:00:00.0 7276 PENDING_PAYMENT
17 2013-07-25 00:00:00.0 2667 COMPLETE
18 2013-07-25 00:00:00.0 1205 CLOSED
19 2013-07-25 00:00:00.0 9488 PENDING_PAYMENT
20 2013-07-25 00:00:00.0 9198 PROCESSING
21 2013-07-25 00:00:00.0 2711 PENDING
22 2013-07-25 00:00:00.0 333 COMPLETE
23 2013-07-25 00:00:00.0 4367 PENDING_PAYMENT
24 2013-07-25 00:00:00.0 11441 CLOSED
25 2013-07-25 00:00:00.0 9503 CLOSED
26 2013-07-25 00:00:00.0 7562 COMPLETE
27 2013-07-25 00:00:00.0 3241 PENDING_PAYMENT
28 2013-07-25 00:00:00.0 656 COMPLETE
29 2013-07-25 00:00:00.0 196 PROCESSING
30 2013-07-25 00:00:00.0 10039 PENDING_PAYMENT

view raw
post45_b.csv
hosted with ❤ by GitHub

You can use the following commands for creating this file.

vi post45_b.csv

###############################
# PASTE COPIED CONTENTS HERE #
###############################

cat post45_b.csv

The output of the above commands look as follows.

Step 2: Creating second input file in the local file system
Step 2: Creating second input file in the local file system

As you can see from the above screenshot, input CSV file post45_b.csv was created successfully in the local file system.

Now, we will move on to the Hive operations.

  • CHECKING PRE-EXISTENCE OF THE HIVE TABLES

We use the “show tables;” command to get the list of tables in the Hive database. This list reveals whether our tables already exist or not.

show tables;

The output of the above command is as follows.

Step 3: Checking pre-existence of the Hive table
Step 3: Checking pre-existence of the Hive table

As you can see from the above screenshot, Hive does not contain tables with names post45_a and post45_b. Therefore, we can go ahead and create those tables with same schema as the input CSV files.

  • CREATING TABLES IN ACCORDANCE WITH THE INPUT CSV FILES

We are going to create the two tables as we want to perform the JOIN operation.

The first table is post45_a, containing customer information and having three columns.

I have uploaded this SQL file to my GitHub profile under HDPCD repository. The name of this file is “58_first_hive_table_for_join.sql“. You can download/save this file by clicking here and it looks as follows.

create table post45_a (
id int,
name string,
gender string
)
row format delimited
fields terminated by ','
stored as textfile;

view raw
post45_a.sql
hosted with ❤ by GitHub

The output of the above commands look as follows.

Step 4: Creating the first Hive table with customer information
Step 4: Creating the first Hive table with customer information

As you can see from the above screenshot, the Hive table post45_a was created successfully. We will follow the same process for creating the other table post45_b.

The second table is post45_b, containing order information and having four columns.

I have uploaded this SQL file to my GitHub profile under HDPCD repository. The name of this file is “59_second_hive_table_for_join.sql“. You can download/save this file by clicking here and it looks as follows.

create table post45_b (
order_id int,
order_date string,
order_amt int,
order_status string
)
row format delimited
fields terminated by ','
stored as textfile;

view raw
post45_b.sql
hosted with ❤ by GitHub

The output of the above commands look as follows.

Step 5: Creating second table with order information
Step 5: Creating second table with order information

As you can see from the above screenshot, the Hive table post45_b was created successfully. Now, it is time to confirm that the Hive tables were created and that also with the correct schema.

  • CONFIRM TABLE EXISTENCE AND SCHEMA

We use the same “show tables;” command to check the existence of the Hive tables.

show tables;

The output of the above command is as follows.

Step 6: Confirming that the Hive tables got created successfully
Step 6: Confirming that the Hive tables got created successfully

As you can see from the above screenshot, both post45_a and post45_b tables were created successfully.

Let us check their schema as well. We use the following commands for doing this.

desc formatted post45_a;
desc formatted post45_b;
The output of the above command is as follows.
Step 7: Checking the schema of the customer information table
Step 7: Checking the schema of the customer information table

The above screenshot shows that the Hive table post45_a was created with the correct schema and it is safe to do the data load operation.

 

Step 8: Checking the schema of the order information table
Step 8: Checking the schema of the order information table

The above screenshot shows that the Hive table post45_b was created with the correct schema and it is safe to do the data load operation.

Let us load the data in these Hive tables, now.

  • LOAD INPUT CSV FILES IN HIVE TABLES

We use the “LOAD” command for loading the data into the Hive tables.

load data local inpath ‘post45_a.csv’ into table post45_a;
The output of the above command is as follows.
Step 9: Loading the data into the customer information table
Step 9: Loading the data into the customer information table

The above screenshot shows that the data was loaded successfully into the Hive table post45_a.

load data local inpath ‘post45_b.csv’ into table post45_b;
The output of the above command is as follows.
Step 10: Loading the data into the order information table
Step 10: Loading the data into the order information table

The above screenshot shows that the data was loaded successfully into the Hive table post45_b.

Let us check records stored in these Hive tables.

  • CONFIRM DATA GOT LOADED SUCCESSFULLY

We use the “SELECT” command to check the records in the Hive tables.

select * from post45_a;
The output of the above command is as follows.
Step 11: Checking the data got loaded successfully into the customer information table
Step 11: Checking the data got loaded successfully into the customer information table

The above screenshot indicates that the data was loaded successfully into the Hive table post45_a and there was no data loss.

And for the second table,

select * from post45_b;
The output of the above command is
Step 12: Checking the data got loaded successfully into the order information table
Step 12: Checking the data got loaded successfully into the order information table

The above screenshot indicates that the data was loaded successfully into the Hive table post45_b and there was no data loss.

This brings us to the last step in this tutorial and that is to perform the JOIN operation.

  • PERFORM THE JOIN OPERATION

We use the “JOIN” keyword for performing the JOIN operation. Based on the join type, you can use INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, and FULL OUTER JOIN keywords. For more information, please click here.

In this case, we are performing the JOIN operation based on the CUSTOMER ID which is the first column in case of both the tables.

Our objective is to extract order details of only those customers whose data is present in the customer (post45_a) table. Therefore, we should get order details of customers with Customer ID from 1 to 9 (both inclusive).

The command to do this is as follows.

select a.name,a.gender,b.order_date,b.order_status from post45_a a join post45_b b on (a.id = b.order_id);
The output of the above command is as follows.
Step 13: Performing the JOIN operation between Hive tables - customer and order
Step 13: Performing the JOIN operation between Hive tables – customer and order

As you can see, the above “JOIN” query triggers a MapReduce/TeZ job.

We get the result of this query after this query executes successfully. The output is shown in the above screenshot.

The customers with customer ID from 1 to 9 is shown in the screenshot, as expected, along with their order details.

This enables us to safely say that the objective of this tutorial is met and we can conclude this tutorial here.

In the next tutorial, we are going to see how to run a Hive query using TEZ as the execution engine.

If you liked the content, share it with your friends. Please hit the Like button.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

 

 

Published by milindjagre

I founded my blog www.milindjagre.co four years ago and am currently working as a Data Scientist Analyst at the Ford Motor Company. I graduated from the University of Connecticut pursuing Master of Science in Business Analytics and Project Management. I am working hard and learning a lot of new things in the field of Data Science. I am a strong believer of constant and directional efforts keeping the teamwork at the highest priority. Please reach out to me at milindjagre@gmail.com for further information. Cheers!

One thought on “Post 45 | HDPCD | Join two Hive tables

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: