Hello, everyone. Welcome to one more tutorial in the HDPCD certification series.
In the last tutorial, we saw how to enable vectorization in Hive.
In this tutorial, we are going to see how to run a subquery within a Hive query.
Let us begin, then.
The following infographics show the step-by-step process of performing this operation.

From the above infographics, we can say that this tutorial can be performed using the following steps.
- CREATING INPUT CSV FILE IN THE LOCAL FILE SYSTEM
As already seen in a lot of my tutorials, we are going to use the VI EDITOR to create input CSV file in the local file system.
I have uploaded this input CSV file to my GitHub profile under HDPCD repository. The name of this file is “62_input_file_for_ordering_output.csv” and you can save/download this file by clicking here. It looks as follows.
1 | 2013-07-25 00:00:00.0 | 11599 | CLOSED | |
---|---|---|---|---|
2 | 2013-07-25 00:00:00.0 | 256 | PENDING_PAYMENT | |
3 | 2013-07-25 00:00:00.0 | 12111 | COMPLETE | |
4 | 2013-07-25 00:00:00.0 | 8827 | CLOSED | |
5 | 2013-07-25 00:00:00.0 | 11318 | COMPLETE | |
6 | 2013-07-25 00:00:00.0 | 7130 | COMPLETE | |
7 | 2013-07-25 00:00:00.0 | 4530 | COMPLETE | |
8 | 2013-07-25 00:00:00.0 | 2911 | PROCESSING | |
9 | 2013-07-25 00:00:00.0 | 5657 | PENDING_PAYMENT | |
10 | 2013-07-25 00:00:00.0 | 5648 | PENDING_PAYMENT | |
11 | 2013-07-25 00:00:00.0 | 918 | PAYMENT_REVIEW | |
12 | 2013-07-25 00:00:00.0 | 1837 | CLOSED | |
13 | 2013-07-25 00:00:00.0 | 9149 | PENDING_PAYMENT | |
14 | 2013-07-25 00:00:00.0 | 9842 | PROCESSING | |
15 | 2013-07-25 00:00:00.0 | 2568 | COMPLETE | |
16 | 2013-07-25 00:00:00.0 | 7276 | PENDING_PAYMENT | |
17 | 2013-07-25 00:00:00.0 | 2667 | COMPLETE | |
18 | 2013-07-25 00:00:00.0 | 1205 | CLOSED | |
19 | 2013-07-25 00:00:00.0 | 9488 | PENDING_PAYMENT | |
20 | 2013-07-25 00:00:00.0 | 9198 | PROCESSING | |
21 | 2013-07-25 00:00:00.0 | 2711 | PENDING | |
22 | 2013-07-25 00:00:00.0 | 333 | COMPLETE | |
23 | 2013-07-25 00:00:00.0 | 4367 | PENDING_PAYMENT | |
24 | 2013-07-25 00:00:00.0 | 11441 | CLOSED | |
25 | 2013-07-25 00:00:00.0 | 9503 | CLOSED | |
26 | 2013-07-25 00:00:00.0 | 7562 | COMPLETE | |
27 | 2013-07-25 00:00:00.0 | 3241 | PENDING_PAYMENT | |
28 | 2013-07-25 00:00:00.0 | 656 | COMPLETE | |
29 | 2013-07-25 00:00:00.0 | 196 | PROCESSING | |
30 | 2013-07-25 00:00:00.0 | 10039 | PENDING_PAYMENT |
Now that you know how the input file looks like, let us create.
You can use the following commands to create this input CSV file in the local file system.
vi post50.csv
#####################################
## PASTE THE COPIED CONTENTS HERE ##
#####################################cat post50.csv
The output of the above command looks as follows.

As you can see from the above screenshot, the input CSV file was successfully created in the local file system.
Now, we will move on to the next step and that is to check the pre-existence of the Hive table.
- CHECKING THE PRE-EXISTENCE OF THE HIVE TABLE
We use the “SHOW TABLES;” command to get the list of the tables in Hive.
show tables;
The output of the above command is as follows.

As you can see, there is no table with a name post50. This confirms that we can go ahead and create the Hive table with name post50.
- CREATING THE HIVE TABLE WITH THE SCHEMA TO THE INPUT CSV FILE
Please have a look at the input CSV file before creating the Hive table. The Hive table schema should exactly match the input CSV file.
After taking a quick look at the input CSV file, you can conclude that the Hive table must contain the four columns. Out of these four columns, the first one is an integer, the second one is the DATE, the third one contains a number, and the last one is a STRING. Therefore, we are going to use the following command to create the Hive table post50.
I have uploaded this input CSV file to my GitHub profile under HDPCD repository. The name of this file is “63_hive_create_table_for_order_by.sql” and you can save/download this file by clicking here. It looks as follows.
create table post50 ( | |
order_id int, | |
order_date string, | |
order_amt int, | |
order_status string | |
) | |
row format delimited | |
fields terminated by ',' | |
stored as textfile; |
The output of the above command is as follows.

As you can see from the above screenshot, the OK statement at the end of the command execution.
Now, let us confirm the existence of the Hive table post50.
- CONFIRM THAT THE HIVE TABLE GOT CREATED SUCCESSFULLY
We use the “show tables;” command to check the existence of the Hive table.
show tables;
The output of the above command is as follows.

The output shown in the above screenshot confirms that the Hive table was created successfully.
The next step is to check that this Hive table follows the same schema as the input CSV file.
- CONFIRM THE HIVE TABLE SCHEMA MATCHES WITH INPUT CSV FILE
We use the “desc formatted post50;” command to check the formatted schema of the Hive table post50.
desc formatted post50;
The output of the above command is as follows.

The output shown in the above screenshot suggests the Hive table follows the same schema as the input CSV file.
Now, we are good to load the input file into the Hive table.
- LOAD THE INPUT CSV FILE IN THE HIVE TABLE
We use the “LOAD” command for loading the input CSV file into the Hive table.
load data local inpath ‘post50.csv’ into table post50;
The output of the above command looks as follows.

From the above output screenshot, it looks like the file was loaded successfully into the Hive table.
We will go one step further and check the actual records in the Hive table post50.
- CONFIRM THAT DATA GOT LOADED SUCCESSFULLY INTO THE HIVE TABLE
We use the “SELECT” command to check the records stored in the Hive table post50.
select * from post50;
The output of the above command is as follows.

The above output screenshot confirms that the file was successfully loaded and we got the exact number of records loaded into the Hive table.
This enables us to examine the query which we are going to use to order the output using the ORDER BY statement.
Let us have a look at that Hive query.
- RUN HIVE QUERY WHICH ORDERS THE OUTPUT ACROSS MULTIPLE REDUCERS
The Hive query which we are going to use for sorting the output in ascending order.
select * from post50 ORDER BY order_id DESC;
The above query returns all the records in the Hive table post50 and the output is ordered by the ORDER ID in the DESCENDING order.
The output of the above query is shown in the following screenshot to better explain the Hive query result.

As you can see from the above screenshot, the above Hive query returned the output sorted in the descending order of the ORDER ID.
One thing to notice here is that this query internally runs a MapReduce job and the output is generated by that MapReduce job.
The Hive query returns the expected output.
This enables us to safely say that the objective of this tutorial is met and we can conclude this tutorial here.
The next tutorial is going to be the last technical tutorial in this certification series, after which I will conclude this tutorial series with few tips about the certification. In the next tutorial, we are going to see how to set an environment property from a Hive query.
If you liked the content, share it with your friends. Please hit the Like button.
You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.
One thought on “Post 50 | HDPCD | Order Hive query output across multiple reducers”