Post 51 | HDPCD | Set Hadoop or Hive Configuration property

Hello, everyone. Welcome to the last technical tutorial in the HDPCD certification series.

It’s funny! This beautiful journey is coming to an end.

In the last tutorial, we saw how to sort the output of a Hive query across multiple reducers.

In this tutorial, we are going to see how to set a Hadoop or Hive configuration property.

Let us begin, then.

It is one of the easiest tutorials in this certification series. Maybe, that is why it is kept, at last, just kidding.

The following commands do depict the usage of the set command which is used for setting Hadoop or Hive configuration property/properties.

Let us see each of these commands one by one.

  • SETTING THE VALUE OF HIVE EXECUTION ENGINE

For doing this, we first check what is the current value of the Hive execution engine.use the SET command in the following way.

set hive.execution.engine;

The above command gives us the default value of the Hive execution engine.

You can see from the above screenshot that the default value of the Hive execution engine is “TEZ”.

Now, if you want to change the Hive execution engine property to MapReduce, you can use the following command.

set hive.execution.engine=mr;

To check whether the above command executed successfully or not, you can run the following command and check the output.

set hive.execution.engine;

The output of the above command shows that MapReduce is the Hive execution engine, as expected.

Few examples showing usage of SET command
Few examples showing usage of SET command

The above screenshot shows that the Hive Execution Engine property was successfully changed to MapReduce from its default value TeZ.

Similarly, we can see that I have changed other properties such as SCRATCH DIRECTORY location and the Hive Header Printer flag.

To give you the gist of this tutorial, I would say it enables you to correctly use the “SET” command.

This concludes the last technical tutorial in the HDPCD certification series.

I hope you guys find these tutorials useful and I would feel privileged if at least one person gets benefited from these tutorials.

Do not forget to let me know if these tutorials came handy to you and were useful for clearing the HDPCD certification. You can contact me using the below form.

The next tutorial is going to be the last tutorial in this certification series, in which I will mention how to register for the exam and what precautions to take on the exam day.

If you liked the content, share it with your friends. Please hit the Like button.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

Post 50 | HDPCD | Order Hive query output across multiple reducers

Hello, everyone. Welcome to one more tutorial in the HDPCD certification series.

In the last tutorial, we saw how to enable vectorization in Hive.

In this tutorial, we are going to see how to run a subquery within a Hive query.

Let us begin, then.

The following infographics show the step-by-step process of performing this operation.

Apache Hive: Ordering output across multiple reducers
Apache Hive: Ordering output across multiple reducers

From the above infographics, we can say that this tutorial can be performed using the following steps.

  • CREATING INPUT CSV FILE IN THE LOCAL FILE SYSTEM

As already seen in a lot of my tutorials, we are going to use the VI EDITOR to create input CSV file in the local file system.

I have uploaded this input CSV file to my GitHub profile under HDPCD repository. The name of this file is “62_input_file_for_ordering_output.csv” and you can save/download this file by clicking here. It looks as follows.

Now that you know how the input file looks like, let us create.

You can use the following commands to create this input CSV file in the local file system.

vi post50.csv

#####################################
## PASTE THE COPIED CONTENTS HERE ##
#####################################

cat post50.csv

The output of the above command looks as follows.

Step 1: Creating input CSV file in the local file system
Step 1: Creating input CSV file in the local file system

As you can see from the above screenshot, the input CSV file was successfully created in the local file system.

Now, we will move on to the next step and that is to check the pre-existence of the Hive table.

  • CHECKING THE PRE-EXISTENCE OF THE HIVE TABLE

We use the “SHOW TABLES;” command to get the list of the tables in Hive.

show tables;

The output of the above command is as follows.

Step 2: Checking the pre-existence of the Hive table
Step 2: Checking the pre-existence of the Hive table

As you can see, there is no table with a name post50. This confirms that we can go ahead and create the Hive table with name post50.

  • CREATING THE HIVE TABLE WITH THE SCHEMA TO THE INPUT CSV FILE

Please have a look at the input CSV file before creating the Hive table. The Hive table schema should exactly match the input CSV file.

After taking a quick look at the input CSV file, you can conclude that the Hive table must contain the four columns. Out of these four columns, the first one is an integer, the second one is the DATE, the third one contains a number, and the last one is a STRING. Therefore, we are going to use the following command to create the Hive table post50.

I have uploaded this input CSV file to my GitHub profile under HDPCD repository. The name of this file is “63_hive_create_table_for_order_by.sql” and you can save/download this file by clicking here. It looks as follows.

The output of the above command is as follows.

Step 3: Creating the Hive table with same schema as input CSV file
Step 3: Creating the Hive table with the same schema as input CSV file

As you can see from the above screenshot, the OK statement at the end of the command execution.

Now, let us confirm the existence of the Hive table post50.

  • CONFIRM THAT THE HIVE TABLE GOT CREATED SUCCESSFULLY

We use the “show tables;” command to check the existence of the Hive table.

show tables;

The output of the above command is as follows.

Step 4: Confirm that the Hive table was successfully created
Step 4: Confirm that the Hive table was successfully created

The output shown in the above screenshot confirms that the Hive table was created successfully.

The next step is to check that this Hive table follows the same schema as the input CSV file.

  • CONFIRM THE HIVE TABLE SCHEMA MATCHES WITH INPUT CSV FILE

We use the “desc formatted post50;” command to check the formatted schema of the Hive table post50.

desc formatted post50;

The output of the above command is as follows.

Step 5: Confirm that the Hive table follows the same schema as input CSV file
Step 5: Confirm that the Hive table follows the same schema as input CSV file

The output shown in the above screenshot suggests the Hive table follows the same schema as the input CSV file.

Now, we are good to load the input file into the Hive table.

  • LOAD THE INPUT CSV FILE IN THE HIVE TABLE

We use the “LOAD” command for loading the input CSV file into the Hive table.

load data local inpath ‘post50.csv’ into table post50;

The output of the above command looks as follows.

Step 6: Load the input CSV file in the Hive table
Step 6: Load the input CSV file in the Hive table

From the above output screenshot, it looks like the file was loaded successfully into the Hive table.

We will go one step further and check the actual records in the Hive table post50.

  • CONFIRM THAT DATA GOT LOADED SUCCESSFULLY INTO THE HIVE TABLE

We use the “SELECT” command to check the records stored in the Hive table post50.

select * from post50;

The output of the above command is as follows.

Step 7: Check that the data was successfully loaded into the Hive table
Step 7: Check that the data was successfully loaded into the Hive table

The above output screenshot confirms that the file was successfully loaded and we got the exact number of records loaded into the Hive table.

This enables us to examine the query which we are going to use to order the output using the ORDER BY statement.

Let us have a look at that Hive query.

  • RUN HIVE QUERY WHICH ORDERS THE OUTPUT ACROSS MULTIPLE REDUCERS

The Hive query which we are going to use for sorting the output in ascending order.

select * from post50 ORDER BY order_id DESC;

The above query returns all the records in the Hive table post50 and the output is ordered by the ORDER ID in the DESCENDING order.

The output of the above query is shown in the following screenshot to better explain the Hive query result.

Step 8: Run the Hive query to order the output across multiple reducers
Step 8: Run the Hive query to order the output across multiple reducers

As you can see from the above screenshot, the above Hive query returned the output sorted in the descending order of the ORDER ID.

One thing to notice here is that this query internally runs a MapReduce job and the output is generated by that MapReduce job.

The Hive query returns the expected output.

This enables us to safely say that the objective of this tutorial is met and we can conclude this tutorial here.

The next tutorial is going to be the last technical tutorial in this certification series, after which I will conclude this tutorial series with few tips about the certification. In the next tutorial, we are going to see how to set an environment property from a Hive query.

If you liked the content, share it with your friends. Please hit the Like button.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

Post 49 | HDPCD | Using a subquery within a Hive query

Hello, everyone. Welcome to one more tutorial in the HDPCD certification series.

In the last tutorial, we saw how to enable vectorization in Hive.

In this tutorial, we are going to see how to run a subquery within a Hive query.

Let us begin, then.

Apache Hive: Running subquery within a query
Apache Hive: Running subquery within a query

As you can see from the above screenshot, the process of performing this objective follows the following set of steps.

We will cover each of these steps as follows.

  • CREATING INPUT CSV FILE IN THE LOCAL FILE SYSTEM

As already seen in a lot of my tutorials, we are going to use the VI EDITOR to create input CSV file in the local file system.

I have uploaded this input CSV file to my GitHub profile under HDPCD repository. The name of this file is “60_input_file_for_subquery.csv” and you can save/download this file by clicking here. It looks as follows.

Now that you know how the input file looks like, let us create.

You can use the following commands to create this input CSV file in the local file system.

vi post49.csv

#####################################
## PASTE THE COPIED CONTENTS HERE ##
#####################################

cat post49.csv

The output of the above command looks as follows.

Step 1: Creating input CSV file in the local file system
Step 1: Creating input CSV file in the local file system

As you can see from the above screenshot, the input CSV file was successfully created in the local file system.

Now, we will move on to the next step and that is to check the pre-existence of the Hive table.

  • CHECKING THE PRE-EXISTENCE OF THE HIVE TABLE

We use the “SHOW TABLES;” command to get the list of the tables in Hive.

show tables;

The output of the above command is as follows.

Step 2: Checking the pre-existence of the Hive table
Step 2: Checking the pre-existence of the Hive table

As you can see, there is no table with a name post49. This confirms that we can go ahead and create the Hive table with name post49.

  • CREATING THE HIVE TABLE WITH THE SCHEMA TO THE INPUT CSV FILE

Please have a look at the input CSV file before creating the Hive table. The Hive table schema should exactly match the input CSV file.

After taking a quick look at the input CSV file, you can conclude that the Hive table must contain the four columns. Out of these four columns, the first one is an integer, the second one is the DATE , the third one contains a number, and the last one is a STRING. Therefore, we are going to use the following command to create the Hive table post49.

I have uploaded this input CSV file to my GitHub profile under HDPCD repository. The name of this file is “61_hive_create_table_for_subquery.sql” and you can save/download this file by clicking here. It looks as follows.

The output of the above command is as follows.

Step 3: Creating Hive table with matching schema
Step 3: Creating Hive table with matching schema

As you can see from the above screenshot, the OK statement at the end of the command execution.

Now, let us confirm the existence of the Hive table post49.

  • CONFIRM THAT THE HIVE TABLE GOT CREATED SUCCESSFULLY

We use the “show tables;” command to check the existence of the Hive table.

show tables;

The output of the above command is as follows.

Step 4: Confirming the existence of newly created Hive table
Step 4: Confirming the existence of newly created Hive table

The output shown in the above screenshot confirms that the Hive table was created successfully.

The next step is to check that this Hive table follows the same schema as the input CSV file.

  • CONFIRM THE HIVE TABLE SCHEMA MATCHES WITH INPUT CSV FILE

We use the “desc formatted post49;” command to check the formatted schema of the Hive table post49.

desc formatted post49;

The output of the above command is as follows.

Step 5: Checking the schema of the Hive table
Step 5: Checking the schema of the Hive table

The output shown in the above screenshot suggests the Hive table follows the same schema as the input CSV file.

Now, we are good to load the input file into the Hive table.

  • LOAD THE INPUT CSV FILE IN THE HIVE TABLE

We use the “LOAD” command for loading the input CSV file into the Hive table.

load data local inpath ‘post49.csv’ into table post49;

The output of the above command looks as follows.

Step 6: Loading the input CSV file in the Hive table
Step 6: Loading the input CSV file in the Hive table

From the above output screenshot, it looks like the file was loaded successfully into the Hive table.

We will go one step further and check the actual records in the Hive table post49.

  • CONFIRM THAT DATA GOT LOADED SUCCESSFULLY INTO THE HIVE TABLE

We use the “SELECT” command to check the records stored in the Hive table post49.

select * from post49;

The output of the above command is as follows.

Step 7: Confirming that data got loaded into Hive table successfully
Step 7: Confirming that data got loaded into Hive table successfully

The above output screenshot confirms that the file was successfully loaded and we got the exact number of records loaded into the Hive table.

This enables us to examine the subquery which we are going to use in the Hive query.

Let us have a look at that subquery.

  • CHECK THE QUERY WHICH WILL BE USED AS A SUBQUERY

The subquery which we are going to use for running it as a part of Hive query is as follows.

select DISTINCT(id) from post41;

The above query returns all the unique IDs present in the Hive table post41.

The output of the above query is shown in the following screenshot to better explain the Hive query result.

Step 8: Checking the inner query execution output
Step 8: Checking the inner query execution output

As you can see from the above screenshot, the output of this query returns a bunch of numbers which are the ID fields in the Hive table post49.

We are going to use this query as a part of another Hive query as a subquery.

Let us see how we are going to do that.

  • EXECUTE THE SUBQUERY WITHIN A HIVE QUERY

The Hive query which uses the above query as a subquery is as follows.

select * from post49 where order_id IN (select DISTINCT(id) from post41);

The objective of the above Hive query is to return only those records in Hive table post49 for which there exists an ID in the Hive table post41.

To explain more about the subqueries, you can say that those work as replacements for the JOIN operations to somewhat extent. When compared, JOINs perform better than subqueries, therefore most of the real time scenarios contain JOIN operations instead of subqueries.

Now, if the subquery usage is clear, let us have a look at the output generated by the above subquery.

Step 9: Running the Hive subquery within a query
Step 9: Running the Hive subquery within a query

As you can see from the above screenshot, the above subquery returned only 9 records out of the total of 30 records.

One thing to notice here is that this query internally runs a MapReduce job and the output is generated by that MapReduce job.

The Hive subquery returns the expected output.

This enables us to safely say that the objective of this tutorial is met and we can conclude this tutorial here.

In the next tutorial, we are going to see how to order the output of a Hive query across multiple reducers.

If you liked the content, share it with your friends. Please hit the Like button.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

Post 48 | HDPCD | Printing the execution plan of a Hive query

Hello, everyone. Welcome to one more tutorial in the HDPCD certification series.

In the last tutorial, we saw how to enable vectorization in Hive.

In this tutorial, we are going to see how to print the execution plan of a Hive query.

Let us begin, then.

This is one of the simplest tutorials in this certification series. In this tutorial, we are going to see the EXPLAIN command which is used for printing out the execution plan of a Hive query.

The execution plan of a Hive query is useful while debugging that query. If the Hive query does not return the expected output, then you can run the EXPLAIN command followed by the Hive query to print the execution plan of that query.

The output of the EXPLAIN command is a step-by-step process of what operations are performed during the Hive query execution. These operations are shown in the reverse order starting from the last operation till the first one.

One of the examples of the EXPLAIN command is as follows.

explain select count(*) from post41;

The output of the above command is shown in the following screenshot.

EXPLAIN command to print the execution plan of a query
EXPLAIN command to print the execution plan of a query

The output shown in the above screenshot shows that the query execution starts from the bottom shown as Arrow 1 and ends at the Group by the operation shown as Arrow 5.

The same type of output is shown for all the queries depending on the operations performed for executing that Hive query.

This enables us to safely say that the objective of this tutorial is met and we can conclude this tutorial here.

In the next tutorial, we are going to see how to run a subquery inside a Hive query.

If you liked the content, share it with your friends. Please hit the Like button.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

Post 47 | HDPCD | Run a Hive query using Vectorization

Hello, everyone. Welcome to one more tutorial in the HDPCD certification series.

In the last tutorial, we saw how to run a Hive Query using TeZ execution engine.

In this tutorial, we are going to see how to run a Hive Query using Vectorization.

Let us begin, then.

Before starting off with the objective of this tutorial, let us discuss why we use the Vectorization in Hive.

One of the most important reasons we enable Vectorization in Hive is to reduce the time that a complex query takes to produce an output. In HDP Sandbox, like the TeZ execution engine, it is enabled by default.

We will discuss the step-by-step process to perform this objective in the HPDCD certification series.

  • CHECKING THE DEFAULT VALUE OF THE HIVE VECTORIZATION

As soon as you log into the Hive session, you can see that Hive Vectorization is ENABLED by default.

We use the following command to check the value of the Hive Vectorization.

set hive.vectorized.execution.enabled;

The output of the above query is as follows.

Step 1: Checking the default value of the Hive Vectorization Mode
Step 1: Checking the default value of the Hive Vectorization Mode

As you can see from the above screenshot, you can see that Hive Vectorization is enabled by default in the Hive session.

Now, let us try to DISABLE it by setting the Hive Vectorization flag as FALSE.

  • SETTING THE HIVE VECTORIZATION TO FALSE

We use the following command to set the Hive Vectorization flag to FALSE.

set hive.vectorized.execution.enabled=false;

Once the Hive Vectorization flag is set to FALSE, we can use the above-mentioned command to check the new value of the Hive Vectorization.

set hive.vectorized.execution.enabled;

The output of the above commands is as follows.

Step 2 Setting the Hive Vectorization to FALSE
Step 2 Setting the Hive Vectorization to FALSE

The above screenshot indicates that the Hive Vectorization was set to FALSE.

Let us check what happens by disabling Hive Vectorization.

  • CHECKING THE EFFECT OF THE DISABLED HIVE VECTORIZATION

We are going to run a complex Hive query to check how much time it takes with DISABLED Hive Vectorization flag.

The complex Hive query is as follows.

select * from post41 order by id desc;

The output of the above query is as follows.

Step 3: Checking the effect of disabled Hive Vectorization
Step 3: Checking the effect of disabled Hive Vectorization

The above screenshot indicates that the query took around 70 seconds with Hive Vectorization flag set to FALSE.

Let us change this Hive Vectorization flag to TRUE, now.

  • SETTING THE HIVE VECTORIZATION TO TRUE

We use the following command to set the Hive Vectorization flag to TRUE.

set hive.vectorized.execution.enabled=true;

Once this command is set, we can check the Hive Vectorization flag value by running the following command.

set hive.vectorized.execution.enabled;

The output of the above commands is as follows.

Step 4: Setting the Hive Vectorization to TRUE
Step 4: Setting the Hive Vectorization to TRUE

The above screenshot shows that the value of Hive Vectorization flag was set to TRUE.

Let us run the same Hive complex query to check how much time it takes to produce the same output.

  • CHECKING THE EFFECT OF THE ENABLED HIVE VECTORIZATION

We are going to run the following Hive complex query, as we have already run with the Vectorization flag set to FALSE.

select * from post41 order by id desc;

The output of the above command is as follows.

Step 5: Checking the effect of enabled Hive Vectorization
Step 5: Checking the effect of enabled Hive Vectorization

As you can see from the above screenshot, the time it takes to produce the same output is around 25 seconds. By enabling the Hive Vectorization flag, we saved more than 50% of the time to run a complex Hive query.

This is one of the few examples to convey the importance of the Hive Vectorization.

We can conclude this tutorial here. In the next tutorial, we are going to see how to print the execution plan of a Hive query.

Till then, stay tuned and keep on sharing the contents.

We are not just four more posts away from completing all the tutorials.

I hope you guys like the content.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

Post 46 | HDPCD | Run a Hive Query using Tez Execution Engine

Hello, everyone. Welcome to one more tutorial in the HDPCD certification series.

In the last tutorial, we saw how to perform a JOIN operation between two Hive tables.

In this tutorial, we are going to see how to run a Hive Query using TeZ execution engine.

Let us begin, then.

Apache Hive: Setting the execution engine to TeZ
Apache Hive: Setting the execution engine to TeZ

As you can see from the above info-graphics, this process follows some housekeeping activities, which might not be necessary always, but, it is good to check output of those commands before moving forward.

Having said that, we will follow the below steps to execute this objective.

  • CHECKING THE DEFAULT EXECUTION ENGINE FOR HDP SANDBOX

As soon as you login into Hive with the help of “hive” command, you can check the default value of the Execution Engine for the HDP Sandbox.

We use the following command for checking the value of the execution engine.

set hive.execution.engine;

The output of the above command is as follows.

Step 1: Checking the default execution engine for HDP Sandbox
Step 1: Checking the default execution engine for HDP Sandbox

As can be seen, the default value of the execution engine is tez in HDP Sandbox. This indicates that we do not need to do anything to perform this objective. But, we will not do that.

We will first change the default value of the execution engine to MapReduce and then re-change it to TeZ.

  • SETTING THE EXECUTION ENGINE TO MAPREDUCE

We use the “set” command to set the execution engine in Hive.

set hive.execution.engine=mr;

Notice that “mr” in the above command indicates the MapReduce execution engine.

We use the following command to check the effect of the above command.

set hive.execution.engine;

The output of the above two commands is as follows.

Step 2: Changing the execution engine to MapReduce
Step 2: Changing the execution engine to MapReduce

As you can see, the above screenshot indicates that the value of the execution engine was changed to MapReduce.

Let us confirm this change.

  • CONFIRMING THAT THE EXECUTION ENGINE CHANGED TO MAPREDUCE

We are going to run a complex command to check which execution engine is being used by Hive.

Whenever we run a complex query in Hive, that query is executed by the execution engine and hence we can figure out which execution engine is used by Hive to run that command.

We are going to run the following command in Hive.

select * from post41 order by id desc;

The output of the above command is as follows.

Step 3: Confirming the execution engine was changed to MapReduce
Step 3: Confirming the execution engine was changed to MapReduce

As you can see, the above command triggers a MapReduce job, evident from the prompts given by the command output.

This confirms that we were able to change the execution engine in Hive.

Now, let us re-change it to TeZ.

  • SETTING THE EXECUTION ENGINE TO TEZ

We use the following command to set the Hive execution engine to TeZ.

set hive.execution.engine=tez;

And the following command is used for checking the effect of the above command.

set hive.execution.engine;

The output of the above commands is as follows.

Step 4: Changing the execution engine to TeZ
Step 4: Changing the execution engine to TeZ

The above screenshot confirms that the Hive execution engine value was changed to TeZ from the existing value of MapReduce.

Let us confirm this change.

  • CONFIRMING THAT THE EXECUTION ENGINE CHANGED TO TEZ

As already stated, we use the same complex Hive query to check the effect of the execution engine.

select * from post41 order by id desc;

The output of the above command is as follows.

Step 5: Confirming the execution engine was changed to TeZ
Step 5: Confirming the execution engine was changed to TeZ

You can clearly see the difference between the output of the two execution engines. The above screenshot shows that the Hive used the TeZ components to execute the above-mentioned Hive query, which completes the objective of this tutorial.

We can conclude this tutorial here. In the next tutorial, we are going to see how to execute a Hive query using Vectorization.

Till then, stay tuned and keep on sharing the contents.

We are not just five more posts away from completing all the tutorials.

I hope you guys like the content.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

 

 

 

 

Post 45 | HDPCD | Join two Hive tables

 

Hello, everyone. Welcome to one more tutorial in the HDPCD certification series.

In the last tutorial, we saw how to insert a new row into a Hive table.

In this tutorial, we are going to see how to join two Hive tables.

Let us begin, then.

Apache Hive: Performing join operation
Apache Hive: Performing join operation

The above info-graphics show the step by step process of performing the JOIN operation in Apache Hive.

Let’s start doing the JOIN, then.

  • CREATING INPUT CSV FILES IN THE LOCAL FILE SYSTEM

We use the vi editor to create input CSV files in the local file system.

Here, we are going to create two input files, since, we are joining two tables in this tutorial.

The first file, named, post45_a.csv, contains the customer information, such as, customer ID, customer name, and customer gender.

I have uploaded this input CSV file to my GitHub profile under HDPCD repository. The name of this file is “56_first_input_file_for_join.csv“. You can download/save this file by clicking here and it looks as follows.

You can use the following commands for creating this file.

vi post45_a.csv

###############################
# PASTE COPIED CONTENTS HERE #
###############################

cat post45_a.csv

The output of the above commands look as follows.

Step 1: Creating first input file in the local file system
Step 1: Creating first input file in the local file system

As you can see from the above screenshot, input CSV file post45_a.csv was created successfully in the local file system.

We will follow the same process for creating the second input CSV file.

The name of this file is post45_b.csv and it contains the order information by the customers. It contains fields such as customer ID, timestamp, order amount, and the order status.

I have uploaded this input CSV file to my GitHub profile under HDPCD repository. The name of this file is “57_second_input_file_for_join.csv“. You can download/save this file by clicking here and it looks as follows.

You can use the following commands for creating this file.

vi post45_b.csv

###############################
# PASTE COPIED CONTENTS HERE #
###############################

cat post45_b.csv

The output of the above commands look as follows.

Step 2: Creating second input file in the local file system
Step 2: Creating second input file in the local file system

As you can see from the above screenshot, input CSV file post45_b.csv was created successfully in the local file system.

Now, we will move on to the Hive operations.

  • CHECKING PRE-EXISTENCE OF THE HIVE TABLES

We use the “show tables;” command to get the list of tables in the Hive database. This list reveals whether our tables already exist or not.

show tables;

The output of the above command is as follows.

Step 3: Checking pre-existence of the Hive table
Step 3: Checking pre-existence of the Hive table

As you can see from the above screenshot, Hive does not contain tables with names post45_a and post45_b. Therefore, we can go ahead and create those tables with same schema as the input CSV files.

  • CREATING TABLES IN ACCORDANCE WITH THE INPUT CSV FILES

We are going to create the two tables as we want to perform the JOIN operation.

The first table is post45_a, containing customer information and having three columns.

I have uploaded this SQL file to my GitHub profile under HDPCD repository. The name of this file is “58_first_hive_table_for_join.sql“. You can download/save this file by clicking here and it looks as follows.

The output of the above commands look as follows.

Step 4: Creating the first Hive table with customer information
Step 4: Creating the first Hive table with customer information

As you can see from the above screenshot, the Hive table post45_a was created successfully. We will follow the same process for creating the other table post45_b.

The second table is post45_b, containing order information and having four columns.

I have uploaded this SQL file to my GitHub profile under HDPCD repository. The name of this file is “59_second_hive_table_for_join.sql“. You can download/save this file by clicking here and it looks as follows.

The output of the above commands look as follows.

Step 5: Creating second table with order information
Step 5: Creating second table with order information

As you can see from the above screenshot, the Hive table post45_b was created successfully. Now, it is time to confirm that the Hive tables were created and that also with the correct schema.

  • CONFIRM TABLE EXISTENCE AND SCHEMA

We use the same “show tables;” command to check the existence of the Hive tables.

show tables;

The output of the above command is as follows.

Step 6: Confirming that the Hive tables got created successfully
Step 6: Confirming that the Hive tables got created successfully

As you can see from the above screenshot, both post45_a and post45_b tables were created successfully.

Let us check their schema as well. We use the following commands for doing this.

desc formatted post45_a;
desc formatted post45_b;
The output of the above command is as follows.
Step 7: Checking the schema of the customer information table
Step 7: Checking the schema of the customer information table

The above screenshot shows that the Hive table post45_a was created with the correct schema and it is safe to do the data load operation.

 

Step 8: Checking the schema of the order information table
Step 8: Checking the schema of the order information table

The above screenshot shows that the Hive table post45_b was created with the correct schema and it is safe to do the data load operation.

Let us load the data in these Hive tables, now.

  • LOAD INPUT CSV FILES IN HIVE TABLES

We use the “LOAD” command for loading the data into the Hive tables.

load data local inpath ‘post45_a.csv’ into table post45_a;
The output of the above command is as follows.
Step 9: Loading the data into the customer information table
Step 9: Loading the data into the customer information table

The above screenshot shows that the data was loaded successfully into the Hive table post45_a.

load data local inpath ‘post45_b.csv’ into table post45_b;
The output of the above command is as follows.
Step 10: Loading the data into the order information table
Step 10: Loading the data into the order information table

The above screenshot shows that the data was loaded successfully into the Hive table post45_b.

Let us check records stored in these Hive tables.

  • CONFIRM DATA GOT LOADED SUCCESSFULLY

We use the “SELECT” command to check the records in the Hive tables.

select * from post45_a;
The output of the above command is as follows.
Step 11: Checking the data got loaded successfully into the customer information table
Step 11: Checking the data got loaded successfully into the customer information table

The above screenshot indicates that the data was loaded successfully into the Hive table post45_a and there was no data loss.

And for the second table,

select * from post45_b;
The output of the above command is
Step 12: Checking the data got loaded successfully into the order information table
Step 12: Checking the data got loaded successfully into the order information table

The above screenshot indicates that the data was loaded successfully into the Hive table post45_b and there was no data loss.

This brings us to the last step in this tutorial and that is to perform the JOIN operation.

  • PERFORM THE JOIN OPERATION

We use the “JOIN” keyword for performing the JOIN operation. Based on the join type, you can use INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, and FULL OUTER JOIN keywords. For more information, please click here.

In this case, we are performing the JOIN operation based on the CUSTOMER ID which is the first column in case of both the tables.

Our objective is to extract order details of only those customers whose data is present in the customer (post45_a) table. Therefore, we should get order details of customers with Customer ID from 1 to 9 (both inclusive).

The command to do this is as follows.

select a.name,a.gender,b.order_date,b.order_status from post45_a a join post45_b b on (a.id = b.order_id);
The output of the above command is as follows.
Step 13: Performing the JOIN operation between Hive tables - customer and order
Step 13: Performing the JOIN operation between Hive tables – customer and order

As you can see, the above “JOIN” query triggers a MapReduce/TeZ job.

We get the result of this query after this query executes successfully. The output is shown in the above screenshot.

The customers with customer ID from 1 to 9 is shown in the screenshot, as expected, along with their order details.

This enables us to safely say that the objective of this tutorial is met and we can conclude this tutorial here.

In the next tutorial, we are going to see how to run a Hive query using TEZ as the execution engine.

If you liked the content, share it with your friends. Please hit the Like button.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

 

 

Post 44 | HDPCD | Insert a row in the Hive table

Hello, everyone. Welcome to one more tutorial in the HDPCD certification series.

In the last tutorial, we saw how to delete a row in a Hive table.

In this tutorial, we are going to see how to insert a row in the Hive table.

It is quite interesting to see that Hive supports ACID operations now, though the data is stored in HDFS. This is what excites me.

Let us begin, then.

Apache Hive: Insert a new row in Hive table
Apache Hive: Insert a new row in Hive table

The above info-graphics show the step by step process of implementing the objective of this tutorial.

We will perform these steps in the following way.

  • CHECK EXISTENCE OF THE HIVE TABLE

We use the “show tables;” command to check the existence of the Hive table.

show tables;

The output of the above command looks as follows.

Step 1: Checking the pre-existence of hive table post44
Step 1: Checking the pre-existence of hive table post44

As you can see, the Hive table post44 does not exist in the “default” database. This confirms that we can go ahead and create the Hive table with name “post44”.

  • CREATE HIVE TABLE WITH TRANSACTIONAL PROPERTY = TRUE

We use the “CREATE” command to create the Hive table. Here, in this tutorial, we are looking to delete the records stored in the Hive table. Therefore, we have to take an extra measure of setting a table property to make this Hive table as a transactional table. This Hive table is created with the help of the following command.

create table post44 (
id int,
name string,
gender string
)
clustered by (id) into 4 buckets
row format delimited
fields terminated by ‘,’
stored as orc
tblproperties(‘transactional’ = ‘true’);

As you can see from the above code snippet, we have taken the following extra measure while creating this transactional Hive table.

  • The table is bucketed into 4 buckets with “id” as the column
  • The table is made to store the data in ORC file format
  • The table property transactional is set to TRUE to make it a transactional table

If you miss any of the above properties, you won’t be able to insert a new row in the Hive table.

The output of the above command looks as follows.

Step 2: Creating hive table post44
Step 2: Creating hive table post44

As you can see, the “OK” message indicates that the Hive table was created successfully.

Now, we can confirm the existence of the Hive table post44.

  • CONFIRM EXISTENCE AND SCHEMA OF THE HIVE TABLE

We use the same “show tables;” command to confirm the existence of the Hive table.

show tables;

The output of the above command looks as follows.

Step 3: Confirming the existence of hive table post44
Step 3: Confirming the existence of hive table post44

As you can see from the above screenshot, the Hive table post44 was created successfully. We will go one step further and check the schema and other properties of the Hive table post44.

We use the following command for doing this.

desc formatted post44;

The output of the above command looks as follows.

Step 4: Confirming the schema of hive table post44
Step 4: Confirming the schema of hive table post44

As you can see from the above screenshot, the table contains three columns as defined. It has the ORCInputFormat as the file input format and the transactional property is set to TRUE.

As all these properties are satisfied, it is time to populate the Hive table post44 with some sample records.

  • POPULATE HIVE TABLE WITH SAMPLE RECORDS

We use the “INSERT” command to insert the records into the Hive transactional table.

insert into post44 select * from post41;

As you can see from the above command, we are going to load the records from the Hive table post41 into the Hive table post44.

The output of the above command looks like this.

Step 5: Inserting records into table post44
Step 5: Inserting records into table post44

As you can see from the above screenshot, the Hive command triggers a MapReduce job to load the data into the Hive table post44.

In the end, the “OK” message is shown which indicates that the records got inserted successfully into the Hive table post44. Apart from this, a basic stats is shown of the data size that got loaded into the Hive table post44.

This enables us to confirm the existence of the inserted records.

  • CONFIRM RECORDS GOT INSERTED SUCCESSFULLY

We use the “SELECT” command to check the records in the Hive table post44.

select * from post44;

The output of the above command looks as follows.

Step 6: Confirming records got inserted into table post44 successfully
Step 6: Confirming records got inserted into table post44 successfully

As can be seen from the above screenshot, a total of 9 records were successfully inserted into the Hive table post44.

Now, let us add one new row in the Hive table post44.

  • INSERT A NEW ROW IN THE HIVE TABLE

We use the “INSERT” command to update the records stored in the Hive table.

insert into POST44 values (10,’Rahul’,’M’);

In the above command, we are trying to add a new row in the Hive table with id=10 and name = “Rahul” and gender = “M”.

Therefore, to set the expectations right, once we execute the above command, the record with id=10 should be an additional record in the Hive table post44 and we should have a total of 10 records in the Hive table post44.

Let us see how the output of the above command looks like.

Step 7: Performing the INSERT operation on hive table post44
Step 7: Performing the INSERT operation on hive table post44

From the screenshot above, you can see that the “INSERT” command triggers a MapReduce operation. The “OK” message shows that this operation was successful.

The last thing that we would like to do is to check the records again to see the changes reflected or not.

  • CONFIRM THAT RECORDS GOT INSERTED SUCCESSFULLY

We use the “SELECT” command to check the updated records in the Hive table post44.

select * from post44;

The output of the above command is as follows.

Step 8: Confirming the INSERT command execution output
Step 8: Confirming the INSERT command execution output

A total of 10 records can be seen from the above screenshot. Please pay special attention to the record with id=10. As you can see, the record with id=10 is inserted into the table after executing the “INSERT” command.

This meets our expectations and therefore, we can conclude this tutorial here.

Hope you guys like it.

Please stay tuned for more updates.

In the next tutorial, we are going to see how to join two Hive tables.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

 

Post 43 | HDPCD | Delete a row in a Hive table

Hello, everyone. Welcome to one more tutorial in the HDPCD certification series.

In the last tutorial, we saw how to update a row in a Hive table.

In this tutorial, we are going to see how to delete a row in the Hive table.

It is quite interesting to see that Hive supports ACID operations now, though the data is stored in HDFS. This is what excites me.

Let us begin, then.

Apache Hive: Delete a row in Hive table
Apache Hive: Delete a row in Hive table

The above info-graphics show the step by step process of implementing the objective of this tutorial.

We will perform these steps in the following way.

  • CHECK EXISTENCE OF THE HIVE TABLE

We use the “show tables;” command to check the existence of the Hive table.

show tables;

The output of the above command looks as follows.

Step 1: Checking the pre-existence of hive table post43
Step 1: Checking the pre-existence of hive table post43

As you can see, the Hive table post43 does not exist in the “default” database. This confirms that we can go ahead and create the Hive table with name “post43”.

  • CREATE HIVE TABLE WITH TRANSACTIONAL PROPERTY = TRUE

We use the “CREATE” command to create the Hive table. Here, in this tutorial, we are looking to delete the records stored in the Hive table. Therefore, we have to take an extra measure of setting a table property to make this Hive table as a transactional table. This Hive table is created with the help of the following command.

create table post43 (
id int,
name string,
gender string
)
clustered by (id) into 4 buckets
row format delimited
fields terminated by ‘,’
stored as orc
tblproperties(‘transactional’ = ‘true’);

As you can see from the above code snippet, we have taken the following extra measure while creating this transactional Hive table.

  • The table is bucketed into 4 buckets with “id” as the column
  • The table is made to store the data in ORC file format
  • The table property transactional is set to TRUE to make it a transactional table

If you miss any of the above properties, you won’t be able to update a row in the Hive table.

The output of the above command looks as follows.

Step 2: Creating hive table post43
Step 2: Creating hive table post43

As you can see, the “OK” message indicates that the Hive table was created successfully.

Now, we can confirm the existence of the Hive table post43.

  • CONFIRM EXISTENCE AND SCHEMA OF THE HIVE TABLE

We use the same “show tables;” command to confirm the existence of the Hive table.

show tables;

The output of the above command looks as follows.

Step 3: Confirming the existence of hive table post43
Step 3: Confirming the existence of hive table post43

As you can see from the above screenshot, the Hive table post43 was created successfully. We will go one step further and check the schema and other properties of the Hive table post43.

We use the following command for doing this.

desc formatted post43;

The output of the above command looks as follows.

Step 4: Confirming the schema of hive table post43
Step 4: Confirming the schema of hive table post43

As you can see from the above screenshot, the table contains three columns as defined. It has the ORCInputFormat as the file input format and the transactional property is set to TRUE.

As all these properties are satisfied, it is time to populate the Hive table post43 with some sample records.

  • POPULATE HIVE TABLE WITH SAMPLE RECORDS

We use the “INSERT” command to insert the records into the Hive transactional table.

insert into post43 select * from post41;

As you can see from the above command, we are going to load the records from the Hive table post41 into the Hive table post43.

The output of the above command looks like this.

Step 5: Inserting records into table post43

As you can see from the above screenshot, the Hive command triggers a MapReduce job to load the data into the Hive table post43.

In the end, the “OK” message is shown which indicates that the records got inserted successfully into the Hive table post43. Apart from this, a basic stats is shown of the data size that got loaded into the Hive table post43.

This enables us to confirm the existence of the inserted records.

  • CONFIRM RECORDS GOT INSERTED SUCCESSFULLY

We use the “SELECT” command to check the records in the Hive table post43.

select * from post43;

The output of the above command looks as follows.

Step 6: Confirming records got inserted into table post43 successfully
Step 6: Confirming records got inserted into table post43 successfully

As can be seen from the above screenshot, a total of 9 records were successfully inserted into the Hive table post43. Please pay special attention to the record with id=3, because we are going to delete that record.

Now, let us update the already inserted records in the Hive table.

  • DELETE RECORDS IN THE HIVE TABLE USING WHERE CLAUSE

We use the “DELETE” command to update the records stored in the Hive table.

delete from post43 where id=3;

In the above command, we are trying to delete the record of the student with id=3.

Therefore, to set the expectations right, once we execute the above command, the record with id=3 should be gone from the existing Hive table post43 and we should have only 8 records in the Hive table post43.

Let us see how the output of the above command looks like.

Step 7: Performing the DELETE operation on hive table post43

From the screenshot above, you can see that the “DELETE” command triggers a MapReduce operation. The “OK” message shows that this operation was successful.

The last thing that we would like to do is to check the records again to see the changes reflected or not.

  • CONFIRM THAT RECORDS GOT DELETED SUCCESSFULLY

We use the “SELECT” command to check the updated records in the Hive table post43.

select * from post43;

The output of the above command is as follows.

Step 8: Confirming the DELETE command execution output
Step 8: Confirming the DELETE command execution output

A total of 8 records can be seen from the above screenshot. Please pay special attention to the record with id=3. As you can see, the record with id=3 is deleted from the table after executing the “DELETE” command.

This meets our expectations and therefore, we can conclude this tutorial here.

Hope you guys like it.

Please stay tuned for more updates.

In the next tutorial, we are going to see how to insert a new row in the Hive table.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

Post 42 | HDPCD | Update a row in a Hive table

Hello, everyone. Welcome to one more tutorial in the HDPCD certification series.

In the last tutorial, we saw how to load the compressed data into a Hive table.

In this tutorial, we are going to see how to update a row in the Hive table.

It is quite interesting to see that Hive supports ACID operations now, though the data is stored in HDFS. This is what excites me.

Let us begin, then.

Apache Hive: Update a row in Hive table
Apache Hive: Update a row in Hive table

The above info-graphics show the step by step process of implementing the objective of this tutorial.

We will perform these steps in the following way.

  • CHECK EXISTENCE OF THE HIVE TABLE

We use the “show tables;” command to check the existence of the Hive table.

show tables;

The output of the above command looks as follows.

Step 1: Checking the pre-existence of hive table post42
Step 1: Checking the pre-existence of hive table post42

As you can see, the Hive table post42 does not exist in the “default” database. This confirms that we can go ahead and create the Hive table with name “post42”.

  • CREATE HIVE TABLE WITH TRANSACTIONAL PROPERTY = TRUE

We use the “CREATE” command to create the Hive table. Here, in this tutorial, we are looking to update the records stored in the Hive table. Therefore, we have to take an extra measure of setting a table property to make this Hive table as a transactional table. This Hive table is created with the help of the following command.

create table post42 (
id int,
name string,
gender string
)
clustered by (id) into 4 buckets
row format delimited
fields terminated by ‘,’
stored as orc
tblproperties(‘transactional’ = ‘true’);

As you can see from the above code snippet, we have taken the following extra measure while creating this transactional Hive table.

  • The table is bucketed into 4 buckets with “id” as the column
  • The table is made to store the data in ORC file format
  • The table property transactional is set to TRUE to make it a transactional table

If you miss any of the above properties, you won’t be able to update a row in the Hive table.

The output of the above command looks as follows.

Step 2: Creating hive table post42
Step 2: Creating hive table post42

As you can see, the “OK” message indicates that the Hive table was created successfully.

Now, we can confirm the existence of the Hive table post42.

  • CONFIRM EXISTENCE AND SCHEMA OF THE HIVE TABLE

We use the same “show tables;” command to confirm the existence of the Hive table.

show tables;

The output of the above command looks as follows.

Step 3: Confirming the existence of hive table post42
Step 3: Confirming the existence of hive table post42

As you can see from the above screenshot, the Hive table post42 was created successfully. We will go one step further and check the schema and other properties of the Hive table post42.

We use the following command for doing this.

desc formatted post42;

The output of the above command looks as follows.

Step 4: Confirming the schema of hive table post42
Step 4: Confirming the schema of hive table post42

As you can see from the above screenshot, the table contains three columns as defined. It has the ORCInputFormat as the file input format and the transactional property is set to TRUE.

As all these properties are satisfied, it is time to populate the Hive table post42 with some sample records.

  • POPULATE HIVE TABLE WITH SAMPLE RECORDS

We use the “INSERT” command to insert the records into the Hive transactional table.

insert into post42 select * from post41;

As you can see from the above command, we are going to load the records from the Hive table post41 into the Hive table post42.

The output of the above command looks like this.

Step 5: Inserting records into table post42
Step 5: Inserting records into table post42

As you can see from the above screenshot, the Hive command triggers a MapReduce job to load the data into the Hive table post42.

In the end, the “OK” message is shown which indicates that the records got inserted successfully into the Hive table post42. Apart from this, a basic stats is shown of the data size that got loaded into the Hive table post42.

This enables us to confirm the existence of the inserted records.

  • CONFIRM RECORDS GOT INSERTED SUCCESSFULLY

We use the “SELECT” command to check the records in the Hive table post42.

select * from post42;

The output of the above command looks as follows.

Step 6: Confirming records got inserted into table post42 successfully
Step 6: Confirming records got inserted into table post42 successfully

As can be seen from the above screenshot, a total of 9 records were successfully inserted into the Hive table post42. Please pay special attention to the record with id=2, because we are going to update that record.

Now, let us update the already inserted records in the Hive table.

  • UPDATE RECORDS IN THE HIVE TABLE USING WHERE CLAUSE

We use the “UPDATE” command to update the records stored in the Hive table.

update post42 set name=’Milind’ where id=2;

In the above command, we are trying to update the name of the student with id=2. The old name is “Jerry” and the new name is “Milind”.

Therefore, to set the expectations right, once we execute the above command, the record with id=2 should have name=”Milind”.

Let us see how the output of the above command looks like.

Step 7: Performing the UPDATE operation on hive table post42
Step 7: Performing the UPDATE operation on hive table post42

From the screenshot above, you can see that the “UPDATE” command triggers a MapReduce operation. The “OK” message shows that this operation was successful.

The last thing that we would like to do is to check the records again to see the changes reflected or not.

  • CONFIRM THAT RECORDS GOT UPDATED SUCCESSFULLY

We use the “SELECT” command to check the updated records in the Hive table post42.

select * from post42;

The output of the above command is as follows.

Step 8: Confirming the UPDATE command execution output
Step 8: Confirming the UPDATE command execution output

A total of 9 records can be seen from the above screenshot. Please pay special attention to the record with id=2. As you can see, the old name “Jerry” gets overwritten by the new name “Milind”.

This meets our expectations and therefore, we can conclude this tutorial here.

Hope you guys like it.

Please stay tuned for more updates.

In the next tutorial, we are going to see how to see how to delete a row in the Hive table.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.