Post 41 | HDPCD | Loading compressed data into a Hive table

Hello, everyone. Thanks for returning for the next tutorial in the HDPCD certification series. In the last tutorial, we saw how to load data into a Hive table from a SELECT query. In this tutorial, we are going to see how to load the compressed data from into the Hive table.

Let us begin then.

Apache Hive: Loading compressed data
Apache Hive: Loading compressed data

The above infographics show the step by step process to perform this activity.

We will perform these tasks as shown below.

  • CREATING INPUT CSV FILE IN THE LOCAL FILE SYSTEM

We use the vi editor for creating the input CSV file in the local file system.

I have uploaded this input CSV file to my GitHub profile under HDPCD repository with name “54_input_file_for_compressed_data.csv“. You can download this file by clicking here and it looks as follows.

The following commands are used for creating this input CSV file.

vi post41.csv

################################

PASTE THE COPIED CONTENTS HERE

################################

cat post41.csv

The output of the above commands looks as follows.

Step 1: Creating input CSV file in local file system
Step 1: Creating input CSV file in local file system

The above screenshot shows that the file post41.csv was created successfully and it is present in the local file system.

Let us compress this file.

  • COMPRESSING THE INPUT CSV FILE

We use the “gzip” command for compressing the input CSV file. Once compressed, this file will be used for loading into the Hive table.

We use the following commands to check, create, and confirm the existence of the compressed input file.

ls
gzip post41.csv
ls

The output of the above commands looks as follows.

Step 2: Creating the compressed file from the input local file
Step 2: Creating the compressed file from the input local file

The above screenshot confirms that the compressed file was successfully created.

Let us check whether table post41 exists or not.

  • CHECKING THE PRE-EXISTENCE OF THE HIVE TABLE POST41

We use the “show tables;” command to get the list of tables present in the “default” database of the Hive.

show tables;

The output of the above command is as follows.

Step 3: Checking the pre-existence of hive table post41
Step 3: Checking the pre-existence of Hive table post41

As you can see from the above screenshot, the list of tables does not contain a table with name post41. This indicates that we can go ahead and create the Hive table with name post41.

  • CREATING HIVE TABLE WITH NAME POST41

We use the “CREATE” command to create the Hive table to load the data from the local file system.

I have uploaded this input CSV file to my GitHub profile under HDPCD repository with name “55_hive_table_for_compressed_data.sql“. You can download this file by clicking here and it looks as follows.

The execution of the above command looks as follows.

Step 4: Creating hive table post41
Step 4: Creating Hive table post41

The above screenshot shows that we get “OK” message after executing this CREATE statement. This indicates that the table post41 was created successfully.

Let us confirm the existence of the Hive table post41.

  • CONFIRMING THE EXISTENCE AND SCHEMA OF THE HIVE TABLE POST41

We use the same command “show tables;” to check the existence of the Hive table.

show tables;

The output of the above command is as follows.

Step 5: Confirming the existence of hive table post41
Step 5: Confirming the existence of Hive table post41

As you can see, the Hive table post39 was created successfully. This allows us to continue with the further checks of schema confirmation.

We use the same command “desc formatted post41;” to check the schema of the Hive table post41.

desc formatted post41;

The output of the above command is as follows.

Step 6: Confirming the schema of hive table post41
Step 6: Confirming the schema of Hive table post41

As you can see from the above screenshot, the table post41 contains three columns with defined name and data types. The input format of the Hive table post41 is also as expected and that is the TextInputFormat.

This confirms that the Hive table post41 got created successfully with correct schema.

It is time to check how many records exist in the table post41 before actually loading the data.

  • RECORDS BEFORE DATA LOADING FROM COMPRESSED FILE

We use the “SELECT” command to get the records stored in the Hive table.

select * from post41;

The output of the above command is as follows.

Step 7: Checking records in hive table post41 before data loading
Step 7: Checking records in Hive table post41 before data loading

As you can see, we got zero records after executing the above command. This indicates that, by default, there are no records in the Hive table post41.

From this, we can conclude that after loading the data from the post41.csv.gz file, the Hive table post41 should have 9 records with 3 columns.

Let us load the data into the Hive table post41 from compressed file.

  • LOADING DATA INTO HIVE TABLE FROM COMPRESSED FILE

We use the “LOAD” command to load the data from the compressed file into the Hive table. The Hive command is as follows.

load data local inpath ‘post41.csv.gz’ into table post41;

The output of the above command is as follows.

Step 8: Loading compressed records in hive table post41
Step 8: Loading compressed records in Hive table post41

The above screenshot shows that the input compressed file post41.csv.gz was successfully loaded into the Hive table post41. The screenshot shows the stats like the number of files and the size of the input file(s).

This enables us to look for the records stored in the Hive table post41.

  • CHECKING THE RECORDS IN HIVE TABLE POST41

We use the same command “SELECT” command to check the records that got stored in the Hive table.

select * from post41;

The output of the above command is as follows.

Step 9: Checking records in hive table post41 after data loading
Step 9: Checking records in Hive table post41 after data loading

The above screenshot shows the that there is a total of 9 records in the Hive table post41 with 3 columns. This completes our expectations and matches exactly with the input compressed file post41.csv.gz.

We can conclude this tutorial here.

I hope you guys like the contents. Keep sharing it with others.

In the next tutorial, we are going to see how to update a row in a Hive table.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

 

 

 

 

 

 

1 thought on “Post 41 | HDPCD | Loading compressed data into a Hive table

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: