Hi everyone, hope you are finding the tutorials useful. From this tutorial onwards, we are going to perform objectives for HDPCD certification.
In this tutorial, we are going to see the first objective in Data Ingestion category. If you go to Hortonworks’ objective page here, you will find it to be worded as “Import data from a table in a relational database into HDFS.”, shown in below screenshot.
So, let’s get started.
For performing Sqoop Import, we will need some data in MySQL Database Server.
For populating MySQL with some data, we are going to use the retail DB SQL file provided by Cloudera.
You can download this file by clicking here.
The preferred way of downloading this file via command prompt is to use wget command, as shown in below screenshot.
Once you have this file downloaded, please follow below steps to populate your MySQL database server with retail DB.
- Login to MySQL Database Server.
- Create Database with name retail_db.
- Create User with name retail_dba and password hadoop.
- Grant all permissions to retail_dba user.
- Flush privileges.
- Use the retail_db database.
- Load freshly downloaded retail_db file in MySQL Database server.
All of these above steps are incorporated in below screenshot.
For your reference, these commands are typed below.
- create database retail_db;
- create user retail_dba identified by ‘hadoop’;
- grant all on retail_db.* to retail_dba;
- flush privileges;
- use retail_db;
- show tables;
- source /home/horton/retail_db.sql
- show tables;
As you can see from above screenshot, there are currently no tables in retail_db database prior loading the SQL file.
Once “source /home/horton/retail_db.sql” command runs successfully, your MySQL window looks something like this, indicating the six newly created tables in retail_db database.
Now, it is time to perform the import operation using Apache Sqoop.
We will first look at the data that we want to import from MySQL Database.
We are going to import the categories table from retail_db database. Below screenshot shows the total number of records in this table along with sample two records in the same table.
As you can see, it contains 58 records and three columns, namely, category_id, category_department_id and category_name.
MySQL Sample Records and Count
We know now what type of data we want to import. Now let us dive into Sqoop import command.
The command that we are going to use for importing data from MySQL database looks like this.
|sqoop import \|
|–connect jdbc:mysql://sandbox.hortonworks.com:3306/retail_db \|
|–username retail_dba \|
|–password hadoop \|
|–table categories \|
Now, let us examine each line.
Line 1: sqoop import \
Explanation: Both sqoop and import are keywords and this command is used when we want to import data from MySQL Database Server into Hadoop. The backslash (\) at the end indicates that the command is continued on the next line.
Line 2: –connect jdbc:mysql://sandbox.hortonworks.com:3306/retail_db \
Explaination: –connect is a keyword which should be followed by the string known as connection string. The syntax of this connection string is as follows.
Therefore, you can see that,
RDBMS_DATABASE_TYPE = mysql
HOSTNAME = sandbox.hortonworks.com
PORT_NUMBER = 3306
RDBMS_DATABASE_NAME = retail_db
Backslash (\) works the same like Line 1.
Line 3: –username retail_dba \
Explaination: –username flag should be followed by MySQL username which has access to log into retail_db database.
Line 4: –password hadoop \
Explaination: –password should be followed by password for retail_dba user (username mentioned in Line 3)
For security purpose, some users do not pass –password in the command. Instead of –password, -P can also be used, which will ask for password during the execution of the command.
Line 5: –table categories \
Explaination: –table should be followed by the table name in retail_db MySQL Database.
Line 6: -m 6
Explaination: It indicates how many mapper jobs are going to run in parallel, to speed-up the process of sqoop import. If -m option is not specified, then by default, 4 mapper tasks are run, creating 4 output files, which will see in some time.
SQOOP Import Command
Once you run above command, it generates following output.
SQOOP Import Command Output
The interesting thing about the Sqoop Import output is the last line that it prints on the command line.
It indicates the number of records retrieved.
In our case, it is showing Retrieved 58 Records, which is correct. This means our import job executed successfully.
Now it is time to see the output of the Sqoop Import.
If you look at the sqoop import command carefully, then you will notice that I have not passed the target HDFS output directory path, which means sqoop will put the imported data in default HDFS directory. This default directory has the following syntax.
Therefore, our output directory becomes, /user/root/categories
Now, let us see the contents of this HDFS directory.
hadoop fs -ls /user/root/categories
As you can see in below screenshot, this directory contains 7 files. Out of these 7 files, the _SUCCESS file is empty and an indicator that our Sqoop Import operation executed successfully. The rest 6 files, also called as part files, are the actual output data files.
Let us examine one of those 6 files. We will use tail command to see the content of the part file.
hadoop fs -tail /user/root/categories/part-m-00005
As you can see in the screenshot, the output file contains three columns separated by comma, which matches completely with the input MySQL data.
This completes the first objective of HDPCD certification. I hope this tutorial will help you to learn sqoop import.
Stay tuned for the next objective, import results of a query into HDFS.
You can subscribe to my YouTube channel here for video description of these objectives.