Post 41 | HDPCD | Loading compressed data into a Hive table

Hello, everyone. Thanks for returning for the next tutorial in the HDPCD certification series. In the last tutorial, we saw how to load data into a Hive table from a SELECT query. In this tutorial, we are going to see how to load the compressed data from into the Hive table.

Let us begin then.

Apache Hive: Loading compressed data
Apache Hive: Loading compressed data

The above infographics show the step by step process to perform this activity.

We will perform these tasks as shown below.

  • CREATING INPUT CSV FILE IN THE LOCAL FILE SYSTEM

We use the vi editor for creating the input CSV file in the local file system.

I have uploaded this input CSV file to my GitHub profile under HDPCD repository with name “54_input_file_for_compressed_data.csv“. You can download this file by clicking here and it looks as follows.


1 tina F
2 jerry M
3 tom M
4 wonder woman F
5 scoobydoo M
6 donald duck M
7 pink panther F
8 oggy M
9 shinchan M

view raw
post41.csv
hosted with ❤ by GitHub

The following commands are used for creating this input CSV file.

vi post41.csv

################################

PASTE THE COPIED CONTENTS HERE

################################

cat post41.csv

The output of the above commands looks as follows.

Step 1: Creating input CSV file in local file system
Step 1: Creating input CSV file in local file system

The above screenshot shows that the file post41.csv was created successfully and it is present in the local file system.

Let us compress this file.

  • COMPRESSING THE INPUT CSV FILE

We use the “gzip” command for compressing the input CSV file. Once compressed, this file will be used for loading into the Hive table.

We use the following commands to check, create, and confirm the existence of the compressed input file.

ls
gzip post41.csv
ls

The output of the above commands looks as follows.

Step 2: Creating the compressed file from the input local file
Step 2: Creating the compressed file from the input local file

The above screenshot confirms that the compressed file was successfully created.

Let us check whether table post41 exists or not.

  • CHECKING THE PRE-EXISTENCE OF THE HIVE TABLE POST41

We use the “show tables;” command to get the list of tables present in the “default” database of the Hive.

show tables;

The output of the above command is as follows.

Step 3: Checking the pre-existence of hive table post41
Step 3: Checking the pre-existence of Hive table post41

As you can see from the above screenshot, the list of tables does not contain a table with name post41. This indicates that we can go ahead and create the Hive table with name post41.

  • CREATING HIVE TABLE WITH NAME POST41

We use the “CREATE” command to create the Hive table to load the data from the local file system.

I have uploaded this input CSV file to my GitHub profile under HDPCD repository with name “55_hive_table_for_compressed_data.sql“. You can download this file by clicking here and it looks as follows.

create table post41 (
id int,
name string,
gender string
)
row format delimited
fields terminated by ','
stored as textfile;

view raw
post41.sql
hosted with ❤ by GitHub

The execution of the above command looks as follows.

Step 4: Creating hive table post41
Step 4: Creating Hive table post41

The above screenshot shows that we get “OK” message after executing this CREATE statement. This indicates that the table post41 was created successfully.

Let us confirm the existence of the Hive table post41.

  • CONFIRMING THE EXISTENCE AND SCHEMA OF THE HIVE TABLE POST41

We use the same command “show tables;” to check the existence of the Hive table.

show tables;

The output of the above command is as follows.

Step 5: Confirming the existence of hive table post41
Step 5: Confirming the existence of Hive table post41

As you can see, the Hive table post39 was created successfully. This allows us to continue with the further checks of schema confirmation.

We use the same command “desc formatted post41;” to check the schema of the Hive table post41.

desc formatted post41;

The output of the above command is as follows.

Step 6: Confirming the schema of hive table post41
Step 6: Confirming the schema of Hive table post41

As you can see from the above screenshot, the table post41 contains three columns with defined name and data types. The input format of the Hive table post41 is also as expected and that is the TextInputFormat.

This confirms that the Hive table post41 got created successfully with correct schema.

It is time to check how many records exist in the table post41 before actually loading the data.

  • RECORDS BEFORE DATA LOADING FROM COMPRESSED FILE

We use the “SELECT” command to get the records stored in the Hive table.

select * from post41;

The output of the above command is as follows.

Step 7: Checking records in hive table post41 before data loading
Step 7: Checking records in Hive table post41 before data loading

As you can see, we got zero records after executing the above command. This indicates that, by default, there are no records in the Hive table post41.

From this, we can conclude that after loading the data from the post41.csv.gz file, the Hive table post41 should have 9 records with 3 columns.

Let us load the data into the Hive table post41 from compressed file.

  • LOADING DATA INTO HIVE TABLE FROM COMPRESSED FILE

We use the “LOAD” command to load the data from the compressed file into the Hive table. The Hive command is as follows.

load data local inpath ‘post41.csv.gz’ into table post41;

The output of the above command is as follows.

Step 8: Loading compressed records in hive table post41
Step 8: Loading compressed records in Hive table post41

The above screenshot shows that the input compressed file post41.csv.gz was successfully loaded into the Hive table post41. The screenshot shows the stats like the number of files and the size of the input file(s).

This enables us to look for the records stored in the Hive table post41.

  • CHECKING THE RECORDS IN HIVE TABLE POST41

We use the same command “SELECT” command to check the records that got stored in the Hive table.

select * from post41;

The output of the above command is as follows.

Step 9: Checking records in hive table post41 after data loading
Step 9: Checking records in Hive table post41 after data loading

The above screenshot shows the that there is a total of 9 records in the Hive table post41 with 3 columns. This completes our expectations and matches exactly with the input compressed file post41.csv.gz.

We can conclude this tutorial here.

I hope you guys like the contents. Keep sharing it with others.

In the next tutorial, we are going to see how to update a row in a Hive table.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

 

 

 

 

 

 

Post 40 | HDPCD | Load data into hive table as a result of a query

Hi, everyone. Thanks for coming back for one more tutorial in this HDPCD certification series.

In the last tutorial, we saw how to load data into a Hive table from an HDFS directory. In this tutorial, we are going to see how to load the data into a Hive table as a result of a Hive query.

Let us begin then.

Apache Hive: Loading data from a query
Apache Hive: Loading data from a query

The above infographics show the step by step process to perform this activity.

We will perform these tasks as shown below.

  • CHECKING THE PRE-EXISTENCE OF THE HIVE TABLE POST40

We use the “show tables;” command to get the list of tables present in the “default” database of the hive.

show tables;

The output of the above command is as follows.

Step 1: Checking the pre-existence of the hive table
Step 1: Checking the pre-existence of the Hive table

As you can see from the above screenshot, the list of tables does not contain a table with name post40. This indicates that we can go ahead and create the hive table with name post40.

  • SELECT QUERY TO POPULATE THE HIVE TABLE

We need to carefully choose the “SELECT” query which we are going to use for loading the data into Hive table post40.

As we have seen from the above screenshot, the Hive table post39 seems to be existing already, and as seen in the last tutorial, it contains a sample of 9 records with 3 columns. To refresh your memory, we will use the following command to load the data into Hive table post40.

select * from post39;

The output of the above command looks as follows.

Step 2: Checking the SELECT query to populate the new hive table
Step 2: Checking the SELECT query to populate the new hive table

Now that we are good with the SELECT query, let us create the Hive table post40 which follows the schema of the SELECT query.

  • CREATING HIVE TABLE WITH NAME POST40

We use the “CREATE” command to create the Hive table to load the data from the local file system.

I have uploaded this input CSV file to my GitHub profile under HDPCD repository with name “53_create_hive_table_for_SELECT_load.sql“. You can download this file by clicking here and it looks as follows.

create table post40 (
id int,
name string,
gender string
)
row format delimited
fields terminated by ','
stored as textfile;

view raw
post40.sql
hosted with ❤ by GitHub

The execution of the above command looks as follows.

Step 3: Creating the new hive table post40 with same schema of post39
Step 3: Creating the new Hive table post40 with the same schema of post39

The above screenshot shows that we get “OK” message after executing this CREATE statement. This indicates that the table post40 was created successfully.

Let us confirm the existence of the Hive table post40.

  • CONFIRMING THE EXISTENCE OF THE HIVE TABLE POST40

We use the same command “show tables;” to check the existence of the Hive table.

show tables;

The output of the above command is as follows.

Step 4: Confirming the existence of the newly created hive table
Step 4: Confirming the existence of the newly created hive table

As you can see, the hive table post40 was created successfully. This allows us to continue with the further process.

Let us check the records in the Hive table post40 before doing the actual data load.

  • CHECKING THE SCHEMA OF THE HIVE TABLE POST40

We use the “SELECT” command to check the records of the Hive table post40 before loading the data.

select * from post40;

The output of the above command is as follows.

Step 5: Checking the records in hive table post40 before data load
Step 5: Checking the records in hive table post40 before data load

As you can see, we got zero records after executing the above command. This indicates that, by default, there are no records in the Hive table post40.

From this, we can conclude that after loading the data from the above-mentioned SELECT query, the Hive table post40 should have 9 records with 3 columns.

Let us load the data into the hive table post40 from the SELECT query.

  • LOADING DATA INTO HIVE TABLE FROM SELECT QUERY

We use the “INSERT” command to load the data from the SELECT query into the Hive table. The Hive command is as follows.

insert into post40 select * from post39;

The output of the above command is as follows.

Step 6: Loding the data using LOAD command
Step 6: Loading the data using LOAD command

The above screenshot shows that the output of the SELECT query was successfully loaded into the Hive table post40. The screenshot shows the stats like the number of files and the size of the input file(s).

This enables us to look for the records stored in the Hive table post40.

  • CHECKING THE RECORDS IN HIVE TABLE POST40

We use the same command “SELECT” command to check the records that got stored in the Hive table.

select * from post40;

The output of the above command is as follows.

Step 7: Confirming the records stored in hive table post40
Step 7: Confirming the records stored in hive table post40

The above screenshot shows the that there is a total of 9 records in the hive table post40 with 3 columns. This completes our expectations and matches exactly with the SELECT query output.

We can conclude this tutorial here.

I hope you guys like the contents. Keep sharing it with others.

In the next tutorial, we are going to see how to load compressed data into a Hive table.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

 

Post 39 | HDPCD | Load data into a Hive table from an HDFS directory

Hello, everyone. Thanks for returning for the next tutorial in the HDPCD certification series. In the last tutorial, we saw how to load data into a Hive table from a local directory. In this tutorial, we are going to see how to load the data from the local Directory into the Hive table.

Let us begin then.

apache hive: loading data from HDFS file
apache hive: loading data from HDFS file

The above infographics show the step by step process to perform this activity.

We will perform these tasks as shown below.

  • CREATING INPUT CSV FILE IN THE LOCAL FILE SYSTEM

We use the vi editor for creating the input CSV file in the local file system.

I have uploaded this input CSV file to my GitHub profile under HDPCD repository with name “51_input_to_load_from_hdfs.csv“. You can download this file by clicking here and it looks as follows.


1 tina F
2 jerry M
3 tom M
4 wonder woman F
5 scoobydoo M
6 donald duck M
7 pink panther F
8 oggy M
9 shinchan M

view raw
post39.csv
hosted with ❤ by GitHub

The following commands are used for creating this input CSV file.

vi post39.csv

################################

PASTE THE COPIED CONTENTS HERE

################################

cat post39.csv

The output of the above commands looks as follows.

Step 1: Creating input CSV file in local file system
Step 1: Creating input CSV file in local file system

The above screenshot shows that the file post39.csv was created successfully and it is present in the local file system.

Let us push this file to HDFS now.

  • PUSHING THE INPUT CSV FILE FROM LOCAL FILE SYSTEM TO HDFS

We use the “put” command for transferring the above-mentioned input CSV file from the local file system to HDFS.

hadoop fs -mkdir /hdpcd/input/post39
hadoop fs -put post39.csv /hdpcd/input/post39
hadoop fs -cat /hdpcd/input/post39/post39.csv

The output of the above commands looks as follows.

Step 2: Pushing the input file from local file system to HDFS
Step 2: Pushing the input file from local file system to HDFS

The above screenshot confirms that the input CSV file was successfully pushed to HDFS.

Let us check whether table post39 exists or not.

  • CHECKING THE PRE-EXISTENCE OF THE HIVE TABLE POST39

We use the “show tables;” command to get the list of tables present in the “default” database of the hive.

show tables;

The output of the above command is as follows.

Step 3: Checking the pre-existence of the Hive table
Step 3: Checking the pre-existence of the Hive table

As you can see from the above screenshot, the list of tables does not contain a table with name post39. This indicates that we can go ahead and create the hive table with name post39.

  • CREATING HIVE TABLE WITH NAME POST39

We use the “CREATE” command to create the Hive table to load the data from the local file system.

I have uploaded this input CSV file to my GitHub profile under HDPCD repository with name “52_create_hive_table_for_hdfs_load.sql“. You can download this file by clicking here and it looks as follows.

create table post39 (
id int,
name string,
gender string
)
row format delimited
fields terminated by ','
stored as textfile;

view raw
post39.sql
hosted with ❤ by GitHub

The execution of the above command looks as follows.

Step 4: Creating Hive table for loading local data
Step 4: Creating Hive table for loading local data

The above screenshot shows that we get “OK” message after executing this CREATE statement. This indicates that the table post39 was created successfully.

Let us confirm the existence of the Hive table post39.

  • CONFIRMING THE EXISTENCE OF THE HIVE TABLE POST39

We use the same command “show tables;” to check the existence of the Hive table.

show tables;

The output of the above command is as follows.

Step 5: Confirming the existence of the Hive table
Step 5: Confirming the existence of the Hive table

As you can see, the hive table post39 was created successfully. This allows us to continue with the further checks.

Let us check the schema of the Hive table post39.

  • CHECKING THE SCHEMA OF THE HIVE TABLE POST39

We use the “desc” command to check the detailed schema of the Hive table post39.

desc formatted post39;

The output of the above command is as follows.

Step 6: Checking the schema of the Hive table
Step 6: Checking the schema of the Hive table

As you can see from the above screenshot, the table post39 contains three columns with defined name and data types. The input format of the Hive table post39 is also as expected and that is the TextInputFormat.

This confirms that the hive table post39 got created successfully with correct schema.

It is time to check how many records exist in the table post39 before actually loading the data.

  • RECORDS BEFORE DATA LOADING FROM HDFS

We use the “SELECT” command to get the records stored in the Hive table.

select * from post39;

The output of the above command is as follows.

Step 7: Checking the records before the local data load
Step 7: Checking the records before the local data load

As you can see, we got zero records after executing the above command. This indicates that, by default, there are no records in the hive table post39.

From this, we can conclude that after loading the data from post39.csv file, the hive table post39 should have 9 records with 3 columns.

Let us load the data into the hive table post39 from HDFS.

  • LOADING DATA INTO HIVE TABLE FROM HDFS

We use the “LOAD” command to load the data from the local file system into the Hive table. The Hive command is as follows.

load data inpath ‘/hdpcd/input/post39/post39.csv’ into table post39;

The output of the above command is as follows.

Step 8: Loading the local data into the Hive table
Step 8: Loading the local data into the Hive table

The above screenshot shows that the input local file post39.csv was successfully loaded into the Hive table post39. The screenshot shows the stats like the number of files and the size of the input file(s).

This enables us to look for the records stored in the Hive table post39.

  • CHECKING THE RECORDS IN HIVE TABLE POST39

We use the same command “SELECT” command to check the records that got stored in the Hive table.

select * from post39;

The output of the above command is as follows.

Step 9: Checking the records after the local data load
Step 9: Checking the records after the local data load

The above screenshot shows the that there is a total of 9 records in the hive table post39 with 3 columns. This completes our expectations and matches exactly with the input file post39.csv.

We can conclude this tutorial here.

I hope you guys like the contents. Keep sharing it with others.

In the next tutorial, we are going to see how to load data into a Hive table as a result of a query.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

 

Post 38 | HDPCD | Load data into Hive table from a Local Directory

Hello, everyone. Thanks for returning for the next tutorial in the HDPCD certification series. In the last tutorial, we saw how to specify the delimiter of the Hive table. In this tutorial, we are going to see how to load the data from the local Directory into the Hive table.

Let us begin then.

Apache Hive: Loading data from Local file
Apache Hive: Loading data from Local file

The above infographics show the step by step process to perform this activity.

We will perform these tasks as shown below.

  • CREATING INPUT CSV FILE IN THE LOCAL FILE SYSTEM

We use the vi editor for creating the input CSV file in the local file system.

I have uploaded this input CSV file to my GitHub profile under HDPCD repository with name “49_input_to_load_from_local.csv“. You can download this file by clicking here and it looks as follows.


1 tina F
2 jerry M
3 tom M
4 wonder woman F
5 scoobydoo M
6 donald duck M
7 pink panther F
8 oggy M
9 shinchan M

view raw
post38.csv
hosted with ❤ by GitHub

The following commands are used for creating this input CSV file.

vi post38.csv

################################

PASTE THE COPIED CONTENTS HERE

################################

cat post38.csv

The output of the above commands looks as follows.

Step 1: Creating input CSV file in local file system
Step 1: Creating input CSV file in local file system

The above screenshot shows that the file post38.csv was created successfully and it is present in the local file system.

Let us check whether table post38 exists or not.

  • CHECKING THE PRE-EXISTENCE OF THE HIVE TABLE POST38

We use the “show tables;” command to get the list of tables present in the “default” databse of hive.

show tables;

The output of the above command is as follows.

Step 2: Checking the pre-existence of the Hive table
Step 2: Checking the pre-existence of the Hive table

As you can see from the above screenshot, the list of tables does not contain a table with name post38. This indicates that we can go ahead and create the hive table with name post38.

  • CREATING HIVE TABLE WITH NAME POST38

We use the “CREATE” command to create the hive table to load the data from the local file system.

I have uploaded this input CSV file to my GitHub profile under HDPCD repository with name “50_create_hive_table_for_local_load.sql“. You can download this file by clicking here and it looks as follows.

create table post38 (
id int,
name string,
gender string
)
row format delimited
fields terminated by ','
stored as textfile;

view raw
post38.sql
hosted with ❤ by GitHub

The execution of the above command looks as follows.

Step 3: Creating Hive table for loading local data
Step 3: Creating Hive table for loading local data

The above screenshot shows that we get “OK” message after executing this CREATE statement. This indicates that the table post38 was created successfully.

Let us confirm the existence of the hive table post38.

  • CONFIRMING THE EXISTENCE OF THE HIVE TABLE POST38

We use the same command “show tables;” to check the existence of the hive table.

show tables;

The output of the above command is as follows.

Step 4: Confirming the existence of the Hive table
Step 4: Confirming the existence of the Hive table

As you can see, the hive table post38 was created successfully. This allows us to continue with the further checks.

Let us check the schema of the hive table post38.

  • CHECKING THE SCHEMA OF THE HIVE TABLE POST38

We use the “desc” command to check the detailed schema of the hive table post38.

desc formatted post38;

The output of the above command is as follows.

Step 5: Checking the schema of the Hive table
Step 5: Checking the schema of the Hive table

As you can see from the above screenshot, the table post38 contains three columns with defined name and data types. The input format of the hive table post38 is also as expected and that is the TextInputFormat.

This confirms that the hive table post38 got created successfully with correct schema.

It is time to check how many records exists in the table post38 before actually loading the data.

  • RECORDS BEFORE DATA LOADING FROM LOCAL FILE SYSTEM

We use the “SELECT” command to get the records stored in the hive table.

select * from post38;

The output of the above command is as follows.

Step 6: Checking the records before the local data load
Step 6: Checking the records before the local data load

As you can see, we got zero records after executing the above command. This indicates that, by default, there are no records in the hive table post38.

From this, we can conclude that after loading the data from post38.csv file, the hive table post38 should have 9 records with 3 columns.

Let us load the data into the hive table post38 from the local file system.

  • LOADING DATA INTO HIVE TABLE FROM LOCAL FILE SYSTEM

We use the “LOAD” command to load the data from the local file system into the hive table. This command is as follows.

load data local inpath ‘post38.csv’ into table post38;

The output of the above command is as follows.

Step 7: Loading the local data into the Hive table
Step 7: Loading the local data into the Hive table

The above screenshot shows that the input local file post38.csv was successfully loaded into the hive table post38. The screenshot shows the stats like the number of files and the size of the input file(s).

This enables us to look for the records stored in the hive table post38.

  • CHECKING THE RECORDS IN HIVE TABLE POST38

We use the same command “SELECT” command to check the records that got stored in the hive table.

select * from post38;

The output of the above command is as follows.

Step 8: Checking the records after the local data load
Step 8: Checking the records after the local data load

The above screenshot shows the that there are a total of 9 records in the hive table post38 with 3 columns. This completes our expectations and matches exactly with the input file post38.csv.

We can conclude this tutorial here.

I hope you guys like the contents. Keep sharing it with others.

In the next tutorial, we are going to see how to load data into a Hive table from an HDFS Directory.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

 

 

 

Post 37 | HDPCD | Specifying delimiter of a Hive table

Hello, everyone. Thanks for coming back for one more tutorial in this HDPCD certification series.

In the last tutorial, we saw how to specify the storage format of a Hive table. In this tutorial, we are going to see how to specify the delimiter of a Hive table.

We are going to follow the process mentioned in the following infographics.

Apache Hive: Specifying delimiter
Apache Hive: Specifying delimiter

This process is similar to the most of the Hive tutorials we have seen in this certification series. Let us start with this process.

  • CREATING INPUT FILE WITH TAB DELIMITER IN LOCAL FILE SYSTEM AND PUSHING IT INTO HDFS

We can create the input file in the local file system with the help of vi editor. We are going to use the following commands to create the input file in local file system.

I have uploaded this input TSV file to my GitHub profile under HDPCD repository with name “47_input_delimiter_hive.tsv“. You can download this file by clicking here. This file looks as follows.


1 tina F
2 jerry M
3 tom M
4 wonder woman F
5 scoobydoo M
6 donald duck M
7 pink panther F
9 shinchan M
8 oggy M

view raw
post37.tsv
hosted with ❤ by GitHub

We are going to use the following commands to create the input file in local file system.

vi post37.tsv

################################

PASTE THE COPIED CONTENTS HERE

################################

cat post37.tsv

The above commands create the input file post37.tsv in the local file system.

Once the file is created in the local file system, you can push this file to HDFS with the help of put command.

The commands look as follows.

hadoop fs -mkdir /hdpcd/input/post37
hadoop fs -put post37.tsv /hdpcd/input/post37
hadoop fs -cat /hdpcd/input/post37/post37.tsv

The output of the above commands is shown in the following screenshot.

Step 1: creating input TAB-delimited file
Step 1: creating input TAB-delimited file

As you can see from the above screenshot, the file post37.tsv was successfully created in HDFS. This file will be inserted into post37 Hive table as an input.

Now, let us check the existence of the Hive table post37.

  • CHECKING THE PRE-EXISTENCE OF THE HIVE TABLE

You can use the “show tables;” command to check the existence of the Hive table post37.

show tables;

The output of the above command looks like this.

Step 2: Checking pre-existence of hive table
Step 2: Checking pre-existence of hive table

As you can see from the above screenshot, the table post37 does not exist. This indicates that we can go ahead and create the Hive table post37.

Let us create the Hive table post37, now.

  • CREATING THE HIVE TABLE WITH TAB DELIMITER

We use the CREATE command for creating a Hive table with TAB delimiter for the records in post37.tsv.

I have uploaded the SQL file to my GitHub profile under HDPCD repository with name “48_hive_table_tab_delimiter.sql“. You can download this file by clicking here. This file looks as follows.

create external table post37 (
id int,
name string,
gender string
)
row format delimited
fields terminated by '\t'
stored as textfile
location '/hdpcd/input/post37/';

view raw
post37.sql
hosted with ❤ by GitHub

The output of the above command is shown in the following screenshot.

Step 3: Creating hive table with TAB-delimited records
Step 3: Creating hive table with TAB-delimited records

As it is shown in the above screenshot, the “OK” message shows that the Hive table post37 was created successfully. Having said that, let us confirm the existence of the Hive table post37.

You can use the following command to check the existence of the Hive table post37.

show tables;

The output of the above command is as follows.

Step 4: confirming the existence of newly created hive table
Step 4: confirming the existence of newly created hive table

The above screenshot confirms the existence of the Hive table post37. This enables us to go for the schema check of the Hive table post37.

  • CHECK THE SCHEMA OF THE NEWLY CREATED HIVE TABLE

You can use the “desc” command to check the schema of the Hive table post37.

desc formatted post37;

The output of the above command is shown in the following screenshot.

Step 5: confirming the schema and record delimiter of hive table
Step 5: confirming the schema and record delimiter of hive table

The above screenshot confirms the schema of the Hive table post37. At the bottom of the screenshot, as you can see, the FIELD DELIMITER is the TAB character (\t).

Now, the only thing remaining is the records check.

  • CHECK THE RECORDS STORED IN THE HIVE TABLE

You can use the “SELECT” command to check the records stored in the Hive table post37.

The command is as follows.

select * from post37;

The output of the above command is as follows.

Step 6: checking the records in the hive table
Step 6: checking the records in the Hive table

The above screenshot confirms that we were successful in loading the TAB delimited records into the Hive table post37.

This completes the tutorial here.

I hope you guys like the content.

In the next tutorial, we are going to see how to load data into a Hive table from a Local Directory.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

Post 36 | HDPCD | Specifying storage format of a hive table

Hello, everyone. Thanks for coming back to one more tutorial in this certification series.

In the last tutorial, we saw how to insert records into an ORC table from an NON-ORC table. In this tutorial, we are going to see how to specify the storage format of a Hive table.

Let us begin, then.

Traditionally, we are going to start off by checking whether the table POST36 already exists or not.

  • CHECKING THE PRE-EXISTENCE OF THE HIVE TABLE POST36

We use the “show tables;” command to see the list of hive tables in hive’s “default” database.

show tables;

The output of the above command looks like this.

Step 1: Checking the pre-existenece of hive table with defined storage format
Step 1: Checking the pre-existence of hive table with defined storage format

As you can see from the above screenshot, the table POST36 does not exist in Hive. This indicates that we can go ahead and create the Hive table POST36.

Now, let us create the Hive table with the correct schema.

  • CREATING HIVE TABLE WITH SEQUENCEFILE STORAGE FORMAT 

We are using the SequenceFileInputFormat as the input format for the Hive table POST36. You can use different storage formats as well.

Hive supports the following storage formats.

  1. textfile
  2. sequencefile
  3. orc
  4. parquet
  5. avro
  6. rcfile

You can any of the above six storage formats. I am using the sequencefile to use the SequenceFileInputFormat.

For creating this table, we are using the following “CREATE” command.

create table post36 (
id int,
name string,
gender string
)
row format delimited
fields terminated by ','
stored as sequencefile;

view raw
post36.sql
hosted with ❤ by GitHub

I have uploaded this SQL file to my GitHub profile under HDPCD repository with name “46_sequence_file_hive.sql“. You can download this file by clicking here.

The following screenshot shows the execution of the above command.

Step 2: Creating hive table with defined storage type
Step 2: Creating hive table with defined storage type

“OK” message shown in the above screenshot shows that the table POST36 was created successfully.

Now, we will confirm the existence of the Hive table POST36.

  • CONFIRMING THE EXISTENCE OF HIVE TABLE POST36

We use the same “show tables;” command to check whether POST36 was created successfully or not.

show tables;

The execution of the above command is as follows.

Step 3: Confirming the existence of the newly created hive table
Step 3: Confirming the existence of the newly created hive table

As can be seen from the above screenshot, the table POST36 was created successfully.

Now, the last thing we want to check is the schema of this table.

  • CONFIRMING THE SCHEMA OF HIVE TABLE POST36

We use the “desc” command to check the schema of the Hive table POST36.

desc formatted post36;

The output of the above command is as follows.

Step 4: Checking the schema of newly created hive table
Step 4: Checking the schema of newly created hive table

As you can see from the above screenshot, the column names and datatypes are as expected. The storage information shows that the InputFormat of the Hive table is SequenceFileInputFormat, as expected.

This completes our tutorial of creating a Hive table with defined storage format.

I hope you guys like the content.

In the next tutorial, we are going to see how to define the delimiter of a Hive table.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

 

Post 35 | HDPCD | Insert records from NON-ORC table into ORC table

Hello, everyone. Thanks for going through the tutorials. The increasing views and visitors act as a motivation for me.

In the last tutorial, we saw how to define a hive ORC table. In this tutorial, we are going to load data in that ORC table from an NON-ORC table.

For doing this, we are going to follow the below process.

Loading records into ORC table from NON-ORC table
Loading records into ORC table from NON-ORC table

As you can see from the above infographics, this process follows the below steps.

  • CHECK NON-EXISTENCE OF ORC TABLE and EXISTENCE OF NON-ORC TABLE

We use “show tables;” command to check the list of tables present in the hive database.

show tables;

The output of the above command is as follows.

Step 1: Checking the pre-existenece of hive table with ORC format
Step 1: Checking the pre-existence of hive table with ORC format

As you can see from the above screenshot, the table post33 exists which is an NON-ORC table. Also, the target ORC table post35 does not exist, which indicates we can go ahead and create table post35.

But, before doing that, let us check the records present in the table post33.

  • CHECKING THE RECORDS STORED IN NON-ORC TABLE POST33

We use “SELECT” command to see the records stored in the post33 table.

select * from post33;

The output of the above command looks as follows.

Step 2: Checking the input records from NON-ORC Hive table
Step 2: Checking the input records from NON-ORC Hive table

As you can see from the above screenshot, the table post33 contains a total of 9 records with 3 columns.

Therefore, our expectation is that at the end of this tutorial, we should get 9 records with 3 columns in ORC table POST35.

Let us continue with the further steps, then.

  • CREATING HIVE ORC TABLE POST35

We use the “CREATE” command shown in this tutorial to create the hive table with ORC as the storage format. For further explanation, you can refer this tutorial. The “CREATE” command looks like this.

create table post35 (
id int,
name string,
gender string
)
row format delimited
fields terminated by ‘,’
stored as orc;

The output of the above command looks as follows.

Step 3: Creating hive table with ORC format
Step 3: Creating hive table with ORC format

As you can see from the above screenshot, the “OK” message tells us that the table “POST35” was created successfully.

Now, it is time to do the comparison between schema of both tables – POST33 and POST35.

  • CHECKING THE SCHEMA OF TABLES POST33 AND POST35

We use the “desc” and “desc formatted” command to check the schema of tables POST33 and POST35.

desc post33;

desc formatted post35;

The output of the above commands looks like this.

Step 4: Checking schema of newly created hive table
Step 4: Checking schema of newly created hive table

As you can see, the number of columns and the column datatypes match exactly with each other.

The only difference between these two tables is their storage type. The post33 table follows the default storage format i.e. TextInputFormat, the newly created table POST35 follows the ORCInputFormat.

Once we confirm this, it is time to load some records in this ORC table.

  • INSERTING DATA INTO ORC TABLE FROM NON-ORC TABLE

We use the “INSERT” command to load records in the ORC table. The command is as follows.

insert into post35 select * from post33;

The output of the above command looks as follows.

Step 5: Inserting data into ORC table from NON-ORC table
Step 5: Inserting data into ORC table from NON-ORC table

As you can see from the above screenshot, hive triggers a MapReduce job at the backend to perform this activity. The “SUCCEEDED” message shows that the command worked fine and the intended operation was a successful operation.

Apart from the message, it gives various stats such as the number of files, number of records, the total size of the records, and the data size in number of bytes. These stats help to make an informed decision whether the operation executed successfully or not.

Now, the last step is to check the records in the hive table POST35.

  • CHECKING THE RECORDS STORED IN THE TABLE POST35

We use the “SELECT” commands to get the sense of data stored in the table POST35.

select * from post35;

The output of the above command is as follows.

Step 6: Checking the records in the ORC table
Step 6: Checking the records in the ORC table

As you can see from the above screenshot, the records inserted into table POST35 are similar to the table POST33. This confirms that the actual records got loaded into the ORC table from an NON-ORC table.

This completes the tutorial here.

In the next tutorial, we are going to see how to define the storage format of a hive table.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

 

Post 34 | HDPCD | Defining Hive Table using an ORC File Format

Hi, everyone. Thanks for joining me today for this tutorial.

In the last tutorial, we saw how to create a hive table using the SELECT query. In this tutorial, we are going to see how to create a hive table which stores the data in the ORC File Format.

The process of creating this table is similar to the internal table creation process in this tutorial with only one change in the InputFileFormat of the hive table.

Let us get started then.

The process is shown in the following infographics.

Apache Hive: ORC File Format table
Apache Hive: ORC File Format table

As can be seen from the above screenshot, it is a 4-step process.

Let us start the process with executing each step one by one.

  • CHECK IF THE HIVE TABLE NAME ALREADY EXISTS

We have done for almost all the tutorials for Apache Hive.

show tables;

The output of the above command looks as follows.

Step 1: Checking the pre-existence of the hive table with ORC format
Step 1: Checking the pre-existence of the hive table with ORC format

As you can see from the above screenshot, the table with name post34 does not exist. It means that we can go ahead and create this table with name post34.

  • CREATING AND EXECUTING HIVE INTERNAL TABLE WITH ORC FILE FORMAT

The command to create this hive internal table is similar to this tutorial. Therefore, if you want a detailed explanation about schema and the design process, you can refer to this tutorial.

I have uploaded the SQL file to create this hive table in my GitHub profile under HDPCD repository with name “45_hive_table_with_ORC.sql“. You can download this file by clicking here and it looks as follows.

create table post34 (
id int,
name string,
gender string
)
row format delimited
fields terminated by ','
stored as orc;

view raw
post34.sql
hosted with ❤ by GitHub

The only different thing that you can see is the line number 8: “stored as orc;” which indicates that the input format should be ORC.

If the above schema is clear, let us create the hive table with the defined schema.

Step 2: Creating the hive table with ORC format
Step 2: Creating the hive table with ORC format

The above screenshot shows that the schema worked perfectly and we got the expected output i.e. “OK” message after executing this command.

Now, let us check whether this OK message correct or not.

  • VERIFY THAT THE NEW TABLE GOT CREATED SUCCESSFULLY

The updated list of tables in the “default” database confirms whether the hive table post34 got created or not.

show tables;

The output of the above command looks as follows.

Step 3: Confirming the existence of hive table with ORC Format
Step 3: Confirming the existence of hive table with ORC Format

As you can see from the above screenshot, the hive table post34 was created successfully and we got to see it in the updated list of tables in “default” database.

Now, it is time to check the schema of this post34 hive table.

  • CHECKING THE SCHEMA OF THE HIVE ORC TABLE

We are going to use the following command to check the schema of this newly created hive table.

desc formatted post34;

The output of the above command looks like this.

Step 4: Checking the schema and file format of newly created hive table
Step 4: Checking the schema and file format of newly created hive table

As you can see from the above screenshot, the Input and Output file format is ORC unlike Text in this tutorial.

This completes the objective of the tutorial. Hope you guys like the content.

In the next tutorial, we are going to extend this one and load the data into an ORC table from an NON-ORC table.

Till then, stay tuned for the further updates.

You can visit my website at www.milindjagre.com.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

Cheers!

 

 

Post 33 | HDPCD | Define a Table from a SELECT Query

Hello everyone and welcome to one more tutorial in the HDPCD certification series.

In the last tutorial, we saw how to define a BUCKETED hive table. In this tutorial, we are going to see how to create a Hive table from a SELECT query.

Let us begin then.

We are going to follow the below process for creating a brand new table from a SELECT query. It is also called as CTAS which stands for Create Table As Select.

Apache Hive: CTAS
Apache Hive: CTAS

As you can see from the above infographics, it is a 5-step process. We will see each step one by one. So, here it goes.

  • CHECKING IF THE NEW TABLE NAME i.e. POST33 ALREADY EXISTS

You might have gotten this step as of now since we are following it in each tutorial. We use “show tables;” command to get the list of tables and check whether POST33 table already exists.

show tables;

The output of the above command looks something like this.

Step 1: Checking the pre-existence of the hive table
Step 1: Checking the pre-existence of the hive table

As you can see from the above screenshot, the table POST33 does not exist already in the “default” database in hive.

This means, we can go ahead and construct the hive table with name “POST33“.

  • CONSTRUCTING THE SELECT QUERY FOR CREATING TABLE POST33

This is the second step in this process. In this step, we have to construct the SELECT query which we are going to use for creating a table POST33. Now, this SELECT query can be as complicated as it can be, but for demonstration purpose, I am using the simplest SELECT query possible. This SELECT query extract the entire data from the already existing table POST30 which we created in this tutorial. This SELECT query looks like this.

select * from post30;

The output of the above command is as follows.

Step 2: Checking the select query used for population
Step 2: Checking the select query used for population

As you can see from the above screenshot, the SELECT query returned 9 records having 3 columns. Therefore, after creation, our output table POST33 should also contain 9 records with 3 columns and the column names of table POST33 should match with that of POST30.

Now that we have constructed the SELECT query, it is time to build the new table POST33 with the help of CTAS query.

  • USING CTAS TO BUILD NEW HIVE TABLE POST33

We are going to use the following query to create the hive table POST33.

create table post33 as select * from post30;

The output of this command looks like this.

Step 3: Creating table from a SELECT query
Step 3: Creating table from a SELECT query

As it can be seen from the screenshot, Hive server launches the MapReduce job in the back-end to create table POST33 and load the resultant data from the SELECT query into table POST33.

The command returns “OK” message at the end and along with STATUS shown as “SUCCEEDED“. These two things indicate that our command worked fine and the expected output was created.

But, we should better check it before commenting anything on it.

  • CONFIRMING THE EXISTENCE OF THE NEW TABLE POST33

We can use the same “show tables;” command to check if the hive table POST33 was created successfully or not.

show tables;

The output of the above command looks as follows.

Step 4: Confirming the existence of newly created hive table
Step 4: Confirming the existence of newly created hive table

As you can see from the above screenshot, the hive table POST33 was successfully created in the Hive “default” database.

Now, let us check the schema of this newly created hive table POST33.

  • CHECKING THE SCHEMA OF NEWLY CREATED HIVE TABLE POST33

We can use the formatted description of the hive table POST33 to see the column names and their respective data types. This formatted description of the hive table POST33 can be printed with the help of the following hive command.

desc formatted post33;

The output of the above command looks something like this.

Step 5: Checking schema and other properties of newly created hive table
Step 5: Checking schema and other properties of newly created hive table

The above screenshot suggests the following things

  • Table POST33 contains 3 columns
  • The name of these columns match exactly with input hive table POST30
  • The type of the POST33 table is the internal/hive-managed table as it is visible from the Table Type parameter
  • The input format of this POST33 table is TextInputFormat (Default Input format)
  • The table POST33 is NOT PARTITIONED
  • The table POST33 is NOT BUCKETED

This information is very important when it comes to get the sense of the table and column data types.

If this information is clear, it is time to check the records that got stored in the table POST33.

  • CHECKING THE OUTPUT RECORDS IN THE TABLE POST33

We are going to use the simple SELECT command to get the total data residing in the output table POST33.

The command looks as follows.

select * from post33;

The above hive command generates the following output.

Step 6: Checking the records in newly created hive table
Step 6: Checking the records in newly created hive table

As can be seen from the above screenshot, the table POST33 contains a total of 9 records with 3 columns, as expected and already mentioned at the start of this tutorial.

This completes the objective of this tutorial.

In the next tutorial, we are going to see how to define a hive table that stores the data in the ORC File Format, instead of the TextInputFormat.

Hope you guys like the contents.

Please check out my website at www.milindjagre.com.

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.

Post 32 | HDPCD | Defining a Bucketed Hive Table

Hello everyone to the next tutorial in the HDPCD certification series.

In the last tutorial, we saw how to create a Partitioned Hive Table. In this tutorial, we are going to see how to create a Bucketed Hive table.

The process is depicted in the following infographics.

Apache Hive: Creating a BUCKETED table
Apache Hive: Creating a BUCKETED table

As you can see from the above picture, it follows the same process like the previous tutorial, with one additional step.

Let us get started, then.

  • CHECKING WHETHER TABLE ALREADY EXISTS IN HIVE

We have and will use the “show tables;” command to check the pre-existence of the hive table.

show tables;

If you run the above command in the hive terminal window, you will get the list of tables that already exist in the “default” database. The following picture shows the execution of the above command.

Step 1: Checking the pre-existence of hive table
Step 1: Checking the pre-existence of hive table

As you can see from the above screenshot, the table post32 does not exist. This indicates that we can go ahead and create the table with name post32.

  • DEFINING NUMBER OF BUCKETS FOR THE HIVE TABLE

It is the most important step while creating the BUCKETED hive table. As you have seen in the previous tutorial to partition the data based on some column, in this tutorial, we do follow the same concept.

The main difference between PARTITIONING and BUCKETING is that PARTITIONING can be DYNAMIC in nature whereas BUCKETING is always STATIC.

The above sentence indicates that once you define the NUMBER OF BUCKETS, you cannot change it. If you want to change it, then you will have to drop the table and redo the entire process.

Also, defining the number of buckets and the column to be bucketed depend on the data. Let us consider, we want to do bucketing on column ID with values from 1 to 1000. Then, if we decide to go ahead and create 4 buckets, then the distribution of records will be as follows.

Bucket1: 1 to 250
Bucket2: 251 to 500
Bucket3: 501 to 750
Bucket4:  751 to 1000

And once these buckets are formed, if you execute a WHERE query, then it will be redirected to the specific bucket, saving time and resources.

I hope this explains how the buckets work in Hive.

Now, let us start designing the schema for this BUCKETED Hive table.

  • DESIGN SCHEMA FOR THE BUCKETED HIVE TABLE

We are going to store the same student information in this BUCKETED hive table. Therefore we are going to follow the same schema.

Column 1: ID (Int)
Column 2: Name (String)
Column 3: Gender (String)

Bucketed Column: ID

I have uploaded this schema to my GitHub Profile under HDPCD repository with name “44_hive_bucketed_table.sql“. You can download this schema by clicking here and it looks as follows.

create table post32 (
id int,
name string,
gender string
)
clustered by (id) into 4 buckets
row format delimited
fields terminated by ','
stored as textfile;

view raw
post32.sql
hosted with ❤ by GitHub

As you can see from the above schema design, there is a total of 3 columns: ID, Name, and Gender. The number of buckets is defined with a command like “clustered by (id) into 4 buckets” which indicates that there should be 4 buckets created on the column ID. And the rest of the things are similar to the usual hive tables.

Now, if the schema looks good to you, let us go ahead and execute this schema design.

  • CREATE THE BUCKETED HIVE TABLE USING SCHEMA DEFINED ABOVE

You can copy and paste the above schema on the Hive terminal window.

The execution of the above command looks like this.

Step 2: Creating Hive Bucketed Table
Step 2: Creating Hive Bucketed Table

As you can see from the above screenshot, we are getting “OK” message, which indicates that the table post32 was created successfully and there was no error while doing so.

Let us confirm the existence of the Hive BUCKETED table post32.

  • CONFIRMING EXISTENCE OF HIVE BUCKETED TABLE – POST32

We are going to use the “show tables;” command to confirm the existence of the Hive Bucketed table.

show tables;

The execution of the above command looks like this.

Step 3: Confirming the existence of Bucketed Hive Table
Step 3: Confirming the existence of Bucketed Hive Table

As you can see from the above screenshot, the table post32 was created successfully. This confirms that our command worked fine and the destined “post32” table was created.

Now, it is time to check the schema and see the number of buckets for table post32.

  • CHECKING SCHEMA AND NUMBER OF BUCKETS OF BUCKETED HIVE TABLE

We use an extensive version of the “desc” command to check the full schema along with the number of buckets and other properties of the Hive tables.

This command goes something like this.

desc formatted post32;

The keyword “formatted” tells hive to show the detailed information of the table post32 in the formatted manner or readable format.

The output of the above command looks like this.

 

Step 4: Checking schema and Number of Buckets of Bucketed Hive table
Step 4: Checking schema and Number of Buckets of Bucketed Hive table

As can be seen from the above screenshot, the schema of the post32 hive table matches with the defined schema which we used while creating this table. Also, you can see at the bottom, the number of buckets is shown to be 4 and the column on which bucketing is applied is the ID column.

This confirms that Hive was able to process the information that we passed while creating the hive table. This concludes the process of creating a PARTITIONED HIVE table.

Hope you guys like the contents. Keep supporting my website and blog like this.

I am graduating in a week’s time and looking for full-time opportunities in the field of Hadoop and Big Data Analytics.

You can check out my website at www.milindjagre.com.

 

You can check out my LinkedIn profile here. Please like my Facebook page here. Follow me on Twitter here and subscribe to my YouTube channel here for the video tutorials.