The last tutorial talked about transforming data by reducing the number of columns from input to output records. This tutorial is kind of similar, in which, we are going to take the data transformation process one step further.
This tutorial focuses on matching your input records with the Hive table schema. This includes splitting the input records into more number of columns and excluding columns which are not required for the hive table.
Let us get started then. The following infographic shows us the step-by-step process to achieve this.
As can be seen from above picture, the process is similar to the last tutorial. The difference is the process of transformation. If you remember, in the last tutorial, we did only column removal for doing the transformation, but this tutorial focuses on some intense transformation when it comes to the input columns.
Let us look at the input data, then it will be clear to us what kind of transformation I am talking about.
|Milind Jagre||04/23/1991||Hartford CT US|
|Roushan Kumar||10/22/1989||Stanford CT US|
|Suvir Gupta||07/25/1989||Houston TX US|
|Shlok Srivastav||06/23/1989||SF CA US|
|Raghavendra Singh Raghav||07/06/1988||Boston MA US|
As you can see from above snippet, our input file post14.csv contains only three columns, namely, Name, Birth Date, and Location. Now, our goal is to transform the input file shown above into the hive schema as shown below.
If you compare the input post14.csv and the output hive table post14, you will come to know that the input three columns need to be transformed into the output eight columns and the transformation plan is as follows.
- Input column 1: Name => Output column 1 & 2: fname + lname
- Input column 2: Birth Date => Output column 3, 4 & 5: month + day + year
- Input column 3: Location => Output column 6, 7 & 8: city + state + country
I hope above information makes it clear what input columns are transformed into what output columns. If this is the case, let us first create the file in the local file system and then put this file into HDFS location, which will be input location for the pig script.
The following commands are used for creating the file in local file system.
#paste the contents of post14.csv in the terminal window
#save the file by pressing (esc):wq(enter)
The following screenshot might help you to follow above steps.
Once the file is created in local file system, we need to load it into HDFS as pig script picks up the file from HDFS. The following commands are used for loading this file into HDFS location.
hadoop fs -mkdir -p /hdpcd/input/post14/
hadoop fs -put post14.csv /hdpcd/input/post14/
hadoop fs -cat /hdpcd/input/post14/post14.csv
The following screenshot might come handy to get a sense of the execution of the above commands.
Now that we have the file in HDFS, let us take a look at the pig script we are going to run to transform the input data into the output hive table format. I have uploaded this pig script to my GitHub profile under HDPCD repository and it looks something like this.
|–data transformation for matching schema with hive table|
|–LOAD command is used to load the data present in post14.csv file into input_data pig relation|
|–three columns are created while loading the data i.e. name, dob, and location|
|input_data = LOAD '/hdpcd/input/post14/post14.csv' USING PigStorage(',') AS (name:chararray, dob:chararray, location:chararray);|
|–actual data transformation operation starts now|
|–INPUT DATA: Milind Jagre,04/23/1991,Hartford CT US|
|–EXPACTED OUTPUT DATA: Milind,Jagre,4,23,1991,Hartford,CT,US|
|–name field is splitted into First Name and Last Name|
|–dob field is splitted into Month, Day, and Year for creating a clear segragation about the date fields|
|–location field splitted into city, state, and country for creating a clear segragation about the geographical fields|
|hive_data = FOREACH input_data GENERATE SUBSTRING(name, 0, INDEXOF(name, ' ', 0)) as fname,TRIM(SUBSTRING(name, INDEXOF(name, ' ', 0),100)) as lname,SUBSTRING(dob,0,2) as month,GetDay(ToDate(dob,'mm/dd/yyyy')) as day,GetYear(ToDate(dob,'mm/dd/yyyy')) as year,SUBSTRING(location, 0, INDEXOF(location, ' ', 0)) as city,TRIM(SUBSTRING(location, INDEXOF(location, ' ', 0), INDEXOF(location, ' ', 0)+3)) as state,TRIM(SUBSTRING(location, INDEXOF(location, ' ', 0)+3, INDEXOF(location, ' ', 0)+6)) as country;|
|–once we create schema required for hive table, we must store it in HDFS|
|–the data is delimited by tab character and directory is /hdpcd/output/post14|
|STORE hive_data INTO '/hdpcd/output/post14' USING PigStorage('\t');|
If you take a look at the above file, you will notice that the transformation operation is done in line number 12. The pig relation hive_data contains the data which matches exactly with the hive schema. For doing this, we are using the following functions.
- SUBSTRING(): This function is used for calculating the substring of a tuple. For performing this, you need the name of the column, the start index of the string, and the end index of the string.
- TRIM(): This function is used for trimming purposes. The spaces in the string are eliminated with the help of this function.
- INDEXOF(): This function is used for returning the index of a character in a string. This function is mainly used alongside SUBSTRING() function, so that the start index and the end index will be calculated with the help of INDEXOF() function.
- GetDay(): This is a DATE function. As the name itself indicates, this function extracts the day in integer format from the input date.
- GetYear(): This is also a DATE function. It gives us the year from in the input date format.
Once hive_data pig relation is constructed, we store this in the HDFS location ‘/hdpcd/output/post14’ with the help of STORE command. The delimiter in the generated output file is expressed with the help of PigStorage(‘\t’) function, indicating that the output file is TAB DELIMITED.
Now that we have the pig script ready, let us run this script with the help of the following command.
pig -f post14.pig
The following screenshot might be helpful to observe the output of the above command.
Above command gives us the following output.
The above screenshot suggests that a total of five records were imported in ‘/hdpcd/output/post14’.
The next logical step is to observe the output records. For doing this, we will execute the following commands.
hadoop fs -ls /hdpcd/output/post14
hadoop fs -cat /hdpcd/output/post14/part-v000-o000-r-00000
The output of these commands can be seen in below screenshot.
The output file is TAB DELIMITED and contains eight columns as expected. Now is the time to create the hive external table pointing to the location ‘/hdpcd/output/post14’ in HDFS.
We can use the following hive command to create the hive external table – post14.
|create external table post14(|
|row format delimited|
|fields terminated by '\t'|
Above command is going to create the hive table and then we can run the “select” command to view the output in the hive table. This hive command looks as follows.
select * from post14;
The output of the above command looks as follows.
Above screenshot suggests that the pig script ran successfully and we got the output as expected. We can conclude this tutorial here.
To recap, we did input data transformation to match the hive schema using different functions supported by pig. For doing this, we loaded the input data from local file system to HDFS, created the pig script, ran the pig script, created the hive external table, and then ran sample query to check the output created by the pig script.
I hope that the screenshots and commands are helping you to understand the concepts. In the next tutorial, we are going to see the GROUP operation in pig. So please stay tuned for the updates.
Have fun. Cheers!