Post 49 | HDPCD | Using a subquery within a Hive query

Hello, everyone. Welcome to one more tutorial in the HDPCD certification series.

In the last tutorial, we saw how to enable vectorization in Hive.

In this tutorial, we are going to see how to run a subquery within a Hive query.

Let us begin, then.

Apache Hive: Running subquery within a query
Apache Hive: Running subquery within a query

As you can see from the above screenshot, the process of performing this objective follows the following set of steps.

We will cover each of these steps as follows.

  • CREATING INPUT CSV FILE IN THE LOCAL FILE SYSTEM

As already seen in a lot of my tutorials, we are going to use the VI EDITOR to create input CSV file in the local file system.

I have uploaded this input CSV file to my GitHub profile under HDPCD repository. The name of this file is “60_input_file_for_subquery.csv” and you can save/download this file by clicking here. It looks as follows.

Now that you know how the input file looks like, let us create.

You can use the following commands to create this input CSV file in the local file system.

vi post49.csv

#####################################
## PASTE THE COPIED CONTENTS HERE ##
#####################################

cat post49.csv

The output of the above command looks as follows.

Step 1: Creating input CSV file in the local file system
Step 1: Creating input CSV file in the local file system

As you can see from the above screenshot, the input CSV file was successfully created in the local file system.

Now, we will move on to the next step and that is to check the pre-existence of the Hive table.

  • CHECKING THE PRE-EXISTENCE OF THE HIVE TABLE

We use the “SHOW TABLES;” command to get the list of the tables in Hive.

show tables;

The output of the above command is as follows.

Step 2: Checking the pre-existence of the Hive table
Step 2: Checking the pre-existence of the Hive table

As you can see, there is no table with a name post49. This confirms that we can go ahead and create the Hive table with name post49.

  • CREATING THE HIVE TABLE WITH THE SCHEMA TO THE INPUT CSV FILE

Please have a look at the input CSV file before creating the Hive table. The Hive table schema should exactly match the input CSV file.

After taking a quick look at the input CSV file, you can conclude that the Hive table must contain the four columns. Out of these four columns, the first one is an integer, the second one is the DATE , the third one contains a number, and the last one is a STRING. Therefore, we are going to use the following command to create the Hive table post49.

I have uploaded this input CSV file to my GitHub profile under HDPCD repository. The name of this file is “61_hive_create_table_for_subquery.sql” and you can save/download this file by clicking here. It looks as follows.

The output of the above command is as follows.

Step 3: Creating Hive table with matching schema
Step 3: Creating Hive table with matching schema

As you can see from the above screenshot, the OK statement at the end of the command execution.

Now, let us confirm the existence of the Hive table post49.

  • CONFIRM THAT THE HIVE TABLE GOT CREATED SUCCESSFULLY

We use the “show tables;” command to check the existence of the Hive table.

show tables;

The output of the above command is as follows.

Step 4: Confirming the existence of newly created Hive table
Step 4: Confirming the existence of newly created Hive table

The output shown in the above screenshot confirms that the Hive table was created successfully.

The next step is to check that this Hive table follows the same schema as the input CSV file.

  • CONFIRM THE HIVE TABLE SCHEMA MATCHES WITH INPUT CSV FILE

We use the “desc formatted post49;” command to check the formatted schema of the Hive table post49.

desc formatted post49;

The output of the above command is as follows.

Step 5: Checking the schema of the Hive table
Step 5: Checking the schema of the Hive table

The output shown in the above screenshot suggests the Hive table follows the same schema as the input CSV file.

Now, we are good to load the input file into the Hive table.

  • LOAD THE INPUT CSV FILE IN THE HIVE TABLE

We use the “LOAD” command for loading the input CSV file into the Hive table.

load data local inpath ‘post49.csv’ into table post49;

The output of the above command looks as follows.

Step 6: Loading the input CSV file in the Hive table
Step 6: Loading the input CSV file in the Hive table

From the above output screenshot, it looks like the file was loaded successfully into the Hive table.

We will go one step further and check the actual records in the Hive table post49.

  • CONFIRM THAT DATA GOT LOADED SUCCESSFULLY INTO THE HIVE TABLE

We use the “SELECT” command to check the records stored in the Hive table post49.

select * from post49;

The output of the above command is as follows.

Step 7: Confirming that data got loaded into Hive table successfully
Step 7: Confirming that data got loaded into Hive table successfully

The above output screenshot confirms that the file was successfully loaded and we got the exact number of records loaded into the Hive table.

This enables us to examine the subquery which we are going to use in the Hive query.

Let us have a look at that subquery.

  • CHECK THE QUERY WHICH WILL BE USED AS A SUBQUERY

The subquery which we are going to use for running it as a part of Hive query is as follows.

select DISTINCT(id) from post41;

The above query returns all the unique IDs present in the Hive table post41.

The output of the above query is shown in the following screenshot to better explain the Hive query result.

Step 8: Checking the inner query execution output
Step 8: Checking the inner query execution output

As you can see from the above screenshot, the output of this query returns a bunch of numbers which are the ID fields in the Hive table post49.

We are going to use this query as a part of another Hive query as a subquery.

Let us see how we are going to do that.

  • EXECUTE THE SUBQUERY WITHIN A HIVE QUERY

The Hive query which uses the above query as a subquery is as follows.

select * from post49 where order_id IN (select DISTINCT(id) from post41);

The objective of the above Hive query is to return only those records in Hive table post49 for which there exists an ID in the Hive table post41.

To explain more about the subqueries, you can say that those work as replacements for the JOIN operations to somewhat extent. When compared, JOINs perform better than subqueries, therefore most of the real time scenarios contain JOIN operations instead of subqueries.

Now, if the subquery usage is clear, let us have a look at the output generated by the above subquery.

Step 9: Running the Hive subquery within a query
Step 9: Running the Hive subquery within a query

As you can see from the above screenshot, the above subquery returned only 9 records out of the total of 30 records.

One thing to notice here is that this query internally runs a MapReduce job and the output is generated by that MapReduce job.

The Hive subquery returns the expected output.

This enables us to safely say that the objective of this tutorial is met and we can conclude this tutorial here.

In the next tutorial, we are going to see how to order the output of a Hive query across multiple reducers.

If you liked the content, share it with your friends. Please hit the Like button.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

1 thought on “Post 49 | HDPCD | Using a subquery within a Hive query

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: