Post 50 | HDPCD | Order Hive query output across multiple reducers

Hello, everyone. Welcome to one more tutorial in the HDPCD certification series.

In the last tutorial, we saw how to enable vectorization in Hive.

In this tutorial, we are going to see how to run a subquery within a Hive query.

Let us begin, then.

The following infographics show the step-by-step process of performing this operation.

Apache Hive: Ordering output across multiple reducers
Apache Hive: Ordering output across multiple reducers

From the above infographics, we can say that this tutorial can be performed using the following steps.

  • CREATING INPUT CSV FILE IN THE LOCAL FILE SYSTEM

As already seen in a lot of my tutorials, we are going to use the VI EDITOR to create input CSV file in the local file system.

I have uploaded this input CSV file to my GitHub profile under HDPCD repository. The name of this file is “62_input_file_for_ordering_output.csv” and you can save/download this file by clicking here. It looks as follows.

Now that you know how the input file looks like, let us create.

You can use the following commands to create this input CSV file in the local file system.

vi post50.csv

#####################################
## PASTE THE COPIED CONTENTS HERE ##
#####################################

cat post50.csv

The output of the above command looks as follows.

Step 1: Creating input CSV file in the local file system
Step 1: Creating input CSV file in the local file system

As you can see from the above screenshot, the input CSV file was successfully created in the local file system.

Now, we will move on to the next step and that is to check the pre-existence of the Hive table.

  • CHECKING THE PRE-EXISTENCE OF THE HIVE TABLE

We use the “SHOW TABLES;” command to get the list of the tables in Hive.

show tables;

The output of the above command is as follows.

Step 2: Checking the pre-existence of the Hive table
Step 2: Checking the pre-existence of the Hive table

As you can see, there is no table with a name post50. This confirms that we can go ahead and create the Hive table with name post50.

  • CREATING THE HIVE TABLE WITH THE SCHEMA TO THE INPUT CSV FILE

Please have a look at the input CSV file before creating the Hive table. The Hive table schema should exactly match the input CSV file.

After taking a quick look at the input CSV file, you can conclude that the Hive table must contain the four columns. Out of these four columns, the first one is an integer, the second one is the DATE, the third one contains a number, and the last one is a STRING. Therefore, we are going to use the following command to create the Hive table post50.

I have uploaded this input CSV file to my GitHub profile under HDPCD repository. The name of this file is “63_hive_create_table_for_order_by.sql” and you can save/download this file by clicking here. It looks as follows.

The output of the above command is as follows.

Step 3: Creating the Hive table with same schema as input CSV file
Step 3: Creating the Hive table with the same schema as input CSV file

As you can see from the above screenshot, the OK statement at the end of the command execution.

Now, let us confirm the existence of the Hive table post50.

  • CONFIRM THAT THE HIVE TABLE GOT CREATED SUCCESSFULLY

We use the “show tables;” command to check the existence of the Hive table.

show tables;

The output of the above command is as follows.

Step 4: Confirm that the Hive table was successfully created
Step 4: Confirm that the Hive table was successfully created

The output shown in the above screenshot confirms that the Hive table was created successfully.

The next step is to check that this Hive table follows the same schema as the input CSV file.

  • CONFIRM THE HIVE TABLE SCHEMA MATCHES WITH INPUT CSV FILE

We use the “desc formatted post50;” command to check the formatted schema of the Hive table post50.

desc formatted post50;

The output of the above command is as follows.

Step 5: Confirm that the Hive table follows the same schema as input CSV file
Step 5: Confirm that the Hive table follows the same schema as input CSV file

The output shown in the above screenshot suggests the Hive table follows the same schema as the input CSV file.

Now, we are good to load the input file into the Hive table.

  • LOAD THE INPUT CSV FILE IN THE HIVE TABLE

We use the “LOAD” command for loading the input CSV file into the Hive table.

load data local inpath ‘post50.csv’ into table post50;

The output of the above command looks as follows.

Step 6: Load the input CSV file in the Hive table
Step 6: Load the input CSV file in the Hive table

From the above output screenshot, it looks like the file was loaded successfully into the Hive table.

We will go one step further and check the actual records in the Hive table post50.

  • CONFIRM THAT DATA GOT LOADED SUCCESSFULLY INTO THE HIVE TABLE

We use the “SELECT” command to check the records stored in the Hive table post50.

select * from post50;

The output of the above command is as follows.

Step 7: Check that the data was successfully loaded into the Hive table
Step 7: Check that the data was successfully loaded into the Hive table

The above output screenshot confirms that the file was successfully loaded and we got the exact number of records loaded into the Hive table.

This enables us to examine the query which we are going to use to order the output using the ORDER BY statement.

Let us have a look at that Hive query.

  • RUN HIVE QUERY WHICH ORDERS THE OUTPUT ACROSS MULTIPLE REDUCERS

The Hive query which we are going to use for sorting the output in ascending order.

select * from post50 ORDER BY order_id DESC;

The above query returns all the records in the Hive table post50 and the output is ordered by the ORDER ID in the DESCENDING order.

The output of the above query is shown in the following screenshot to better explain the Hive query result.

Step 8: Run the Hive query to order the output across multiple reducers
Step 8: Run the Hive query to order the output across multiple reducers

As you can see from the above screenshot, the above Hive query returned the output sorted in the descending order of the ORDER ID.

One thing to notice here is that this query internally runs a MapReduce job and the output is generated by that MapReduce job.

The Hive query returns the expected output.

This enables us to safely say that the objective of this tutorial is met and we can conclude this tutorial here.

The next tutorial is going to be the last technical tutorial in this certification series, after which I will conclude this tutorial series with few tips about the certification. In the next tutorial, we are going to see how to set an environment property from a Hive query.

If you liked the content, share it with your friends. Please hit the Like button.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

1 thought on “Post 50 | HDPCD | Order Hive query output across multiple reducers

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: