Hello, everyone. Welcome to one more tutorial in the HDPCD certification series.
In the last tutorial, we saw how to enable vectorization in Hive.
In this tutorial, we are going to see how to run a subquery within a Hive query.
Let us begin, then.

As you can see from the above screenshot, the process of performing this objective follows the following set of steps.
We will cover each of these steps as follows.
- CREATING INPUT CSV FILE IN THE LOCAL FILE SYSTEM
As already seen in a lot of my tutorials, we are going to use the VI EDITOR to create input CSV file in the local file system.
I have uploaded this input CSV file to my GitHub profile under HDPCD repository. The name of this file is “60_input_file_for_subquery.csv” and you can save/download this file by clicking here. It looks as follows.
1 | 2013-07-25 00:00:00.0 | 11599 | CLOSED | |
---|---|---|---|---|
2 | 2013-07-25 00:00:00.0 | 256 | PENDING_PAYMENT | |
3 | 2013-07-25 00:00:00.0 | 12111 | COMPLETE | |
4 | 2013-07-25 00:00:00.0 | 8827 | CLOSED | |
5 | 2013-07-25 00:00:00.0 | 11318 | COMPLETE | |
6 | 2013-07-25 00:00:00.0 | 7130 | COMPLETE | |
7 | 2013-07-25 00:00:00.0 | 4530 | COMPLETE | |
8 | 2013-07-25 00:00:00.0 | 2911 | PROCESSING | |
9 | 2013-07-25 00:00:00.0 | 5657 | PENDING_PAYMENT | |
10 | 2013-07-25 00:00:00.0 | 5648 | PENDING_PAYMENT | |
11 | 2013-07-25 00:00:00.0 | 918 | PAYMENT_REVIEW | |
12 | 2013-07-25 00:00:00.0 | 1837 | CLOSED | |
13 | 2013-07-25 00:00:00.0 | 9149 | PENDING_PAYMENT | |
14 | 2013-07-25 00:00:00.0 | 9842 | PROCESSING | |
15 | 2013-07-25 00:00:00.0 | 2568 | COMPLETE | |
16 | 2013-07-25 00:00:00.0 | 7276 | PENDING_PAYMENT | |
17 | 2013-07-25 00:00:00.0 | 2667 | COMPLETE | |
18 | 2013-07-25 00:00:00.0 | 1205 | CLOSED | |
19 | 2013-07-25 00:00:00.0 | 9488 | PENDING_PAYMENT | |
20 | 2013-07-25 00:00:00.0 | 9198 | PROCESSING | |
21 | 2013-07-25 00:00:00.0 | 2711 | PENDING | |
22 | 2013-07-25 00:00:00.0 | 333 | COMPLETE | |
23 | 2013-07-25 00:00:00.0 | 4367 | PENDING_PAYMENT | |
24 | 2013-07-25 00:00:00.0 | 11441 | CLOSED | |
25 | 2013-07-25 00:00:00.0 | 9503 | CLOSED | |
26 | 2013-07-25 00:00:00.0 | 7562 | COMPLETE | |
27 | 2013-07-25 00:00:00.0 | 3241 | PENDING_PAYMENT | |
28 | 2013-07-25 00:00:00.0 | 656 | COMPLETE | |
29 | 2013-07-25 00:00:00.0 | 196 | PROCESSING | |
30 | 2013-07-25 00:00:00.0 | 10039 | PENDING_PAYMENT |
Now that you know how the input file looks like, let us create.
You can use the following commands to create this input CSV file in the local file system.
vi post49.csv
#####################################
## PASTE THE COPIED CONTENTS HERE ##
#####################################cat post49.csv
The output of the above command looks as follows.

As you can see from the above screenshot, the input CSV file was successfully created in the local file system.
Now, we will move on to the next step and that is to check the pre-existence of the Hive table.
- CHECKING THE PRE-EXISTENCE OF THE HIVE TABLE
We use the “SHOW TABLES;” command to get the list of the tables in Hive.
show tables;
The output of the above command is as follows.

As you can see, there is no table with a name post49. This confirms that we can go ahead and create the Hive table with name post49.
- CREATING THE HIVE TABLE WITH THE SCHEMA TO THE INPUT CSV FILE
Please have a look at the input CSV file before creating the Hive table. The Hive table schema should exactly match the input CSV file.
After taking a quick look at the input CSV file, you can conclude that the Hive table must contain the four columns. Out of these four columns, the first one is an integer, the second one is the DATE , the third one contains a number, and the last one is a STRING. Therefore, we are going to use the following command to create the Hive table post49.
I have uploaded this input CSV file to my GitHub profile under HDPCD repository. The name of this file is “61_hive_create_table_for_subquery.sql” and you can save/download this file by clicking here. It looks as follows.
create table post49 ( | |
order_id int, | |
order_date string, | |
order_amt int, | |
order_status string | |
) | |
row format delimited | |
fields terminated by ',' | |
stored as textfile; |
The output of the above command is as follows.

As you can see from the above screenshot, the OK statement at the end of the command execution.
Now, let us confirm the existence of the Hive table post49.
- CONFIRM THAT THE HIVE TABLE GOT CREATED SUCCESSFULLY
We use the “show tables;” command to check the existence of the Hive table.
show tables;
The output of the above command is as follows.

The output shown in the above screenshot confirms that the Hive table was created successfully.
The next step is to check that this Hive table follows the same schema as the input CSV file.
- CONFIRM THE HIVE TABLE SCHEMA MATCHES WITH INPUT CSV FILE
We use the “desc formatted post49;” command to check the formatted schema of the Hive table post49.
desc formatted post49;
The output of the above command is as follows.

The output shown in the above screenshot suggests the Hive table follows the same schema as the input CSV file.
Now, we are good to load the input file into the Hive table.
- LOAD THE INPUT CSV FILE IN THE HIVE TABLE
We use the “LOAD” command for loading the input CSV file into the Hive table.
load data local inpath ‘post49.csv’ into table post49;
The output of the above command looks as follows.

From the above output screenshot, it looks like the file was loaded successfully into the Hive table.
We will go one step further and check the actual records in the Hive table post49.
- CONFIRM THAT DATA GOT LOADED SUCCESSFULLY INTO THE HIVE TABLE
We use the “SELECT” command to check the records stored in the Hive table post49.
select * from post49;
The output of the above command is as follows.

The above output screenshot confirms that the file was successfully loaded and we got the exact number of records loaded into the Hive table.
This enables us to examine the subquery which we are going to use in the Hive query.
Let us have a look at that subquery.
- CHECK THE QUERY WHICH WILL BE USED AS A SUBQUERY
The subquery which we are going to use for running it as a part of Hive query is as follows.
select DISTINCT(id) from post41;
The above query returns all the unique IDs present in the Hive table post41.
The output of the above query is shown in the following screenshot to better explain the Hive query result.

As you can see from the above screenshot, the output of this query returns a bunch of numbers which are the ID fields in the Hive table post49.
We are going to use this query as a part of another Hive query as a subquery.
Let us see how we are going to do that.
- EXECUTE THE SUBQUERY WITHIN A HIVE QUERY
The Hive query which uses the above query as a subquery is as follows.
select * from post49 where order_id IN (select DISTINCT(id) from post41);
The objective of the above Hive query is to return only those records in Hive table post49 for which there exists an ID in the Hive table post41.
To explain more about the subqueries, you can say that those work as replacements for the JOIN operations to somewhat extent. When compared, JOINs perform better than subqueries, therefore most of the real time scenarios contain JOIN operations instead of subqueries.
Now, if the subquery usage is clear, let us have a look at the output generated by the above subquery.

As you can see from the above screenshot, the above subquery returned only 9 records out of the total of 30 records.
One thing to notice here is that this query internally runs a MapReduce job and the output is generated by that MapReduce job.
The Hive subquery returns the expected output.
This enables us to safely say that the objective of this tutorial is met and we can conclude this tutorial here.
In the next tutorial, we are going to see how to order the output of a Hive query across multiple reducers.
If you liked the content, share it with your friends. Please hit the Like button.
You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.
One thought on “Post 49 | HDPCD | Using a subquery within a Hive query”