Hello, everyone. Thanks for coming back! I Hope the tutorials are inspiring you to take each task seriously and perform each operation by understanding why we are performing each step.
In the last tutorial, we saw how to create the Pig Relation with a defined schema. This tutorial is about creating a Pig Relation, but instead of loading data from a flat file, the data will be loaded from the already existing Hive Table. So, let us take a quick look at the steps included in performing this operation.
Below picture gives you a clear idea about the steps we are going to follow to load data from Apache Hive to Apache Pig.
The first thing that we are going to do is to check the hive table and its schema. Based on the schema, we will have an idea about the structure of the imported pig relation. We are going to log into hive terminal and then look for the schema of products table as it is the table whose data we would like to import into the Pig Relation.
We use the following command to log into hive and look for the products table.
Once you are in the hive terminal, we can run the following command to get the list of tables.
For your reference, following screenshot gives the output of the commands that are shown above.
Once we confirm that we have the products table in the hive database, let us look at the structure of the products table. For doing this, we can use describe and select commands which are shown below.
select * from products limit 10;
Following screenshot shows how the above two commands run.
Once we know the data structure and sample data, it is time to write a Pig script which will import this hive data into the pig relation.
The script file to load data from Apache Hive to Apache Pig is uploaded to my GitHub profile and it looks as follows.
|–execute this file by using -useHCatalog flag|
|–the flag "-useHCatalog" enables Pig to pick jars for HCatalog|
|–HCatalog is used for loading data from Apache Hive to Pig|
|–loading the data in "products" hive table into a Pig Relation|
|hive_data = LOAD 'products' USING org.apache.hive.hcatalog.pig.HCatLoader();|
|–looking at the structure of the data|
|–dumping the data on the terminal window|
Now, let us go through each line to understand what is going on here.
hive_data = LOAD ‘products’ USING org.apache.hive.hcatalog.pig.HCatLoader();
EXPLANATION: Above line contains the meat of our objective of the tutorial. It loads the data from the products table in hive in a pig relation called hive_data. As you can see, there is a class involved in this import operation. The fully qualified name of this class is “org.apache.hive.hcatalog.pig.HCatLoader”.
The above-mentioned class resides in one of the jar files in HCatalog directory and when you run above command, that jar file is used to successfully execute this operation. This is the sole reason we run above post12.pig file with the -useHCatalog flag.
EXPLANATION: As you might be aware of it now, DESCRIBE command is used for viewing the datatypes and column in the Pig Relation hive_data.
EXPLANATION: The DUMP command is used for printing the contents stored in the hive_data Pig Relation. This command is not required as part of the objective, but we are still executing it to confirm that hive table data got loaded into the Pig Relation successfully.
We use the vi command to create this file in the terminal window. Once, the contents of the pig script file are created, we run the cat command to verify the file got created successfully. The following screenshot gives you a clear idea about this.
Once, the pig script is ready, we can run it. Let us see what happens if we use the traditional pig command to run this script.
As you can see from the above screenshot if you don’t use -useHcatalog flag with the pig command, then the command is going to fail and you will get an error saying “Could not resolve org.apache.hive.hcatalog.pig.HCatLoader using imports“. This error clearly indicates Pig was not able to find the jar files required to kick-off HCatalog functionality.
To resolve this issue, we should run above command with -useCatalog flag. Once we use this flag, pig will pick up jar files required to run HCatalog services required to import hive data into pig relation. For your reference, following is the correct command used for this tutorial.
pig -useHcatalog -f post12.pig
The following screenshot shows that the file ran successfully and we got to see the output as well.
Following is the output of the DUMP command.
And the structure of the pig relation looks as follows.
Above screenshot shows that we got the output as expected.
I hope all the tutorials are helping you in understanding the requirements to clear the certification. In the next tutorial, we are going to see how to format the data in the specified format using pig relation.
Please stay tuned for the further updates.
Thanks for having a read.