Post 14 | HDPCD | Data Transformation to match Hive Schema using Apache Pig

The last tutorial talked about transforming data by reducing the number of columns from input to output records. This tutorial is kind of similar, in which, we are going to take the data transformation process one step further.

This tutorial focuses on matching your input records with the Hive table schema. This includes splitting the input records into more number of columns and excluding columns which are not required for the hive table.

Let us get started then. The following infographic shows us the step-by-step process to achieve this.

data transformation using pig
data transformation using apache pig to match hive schema

As can be seen from above picture, the process is similar to the last tutorial. The difference is the process of transformation. If you remember, in the last tutorial, we did only column removal for doing the transformation, but this tutorial focuses on some intense transformation when it comes to the input columns.

Let us look at the input data, then it will be clear to us what kind of transformation I am talking about.


Milind Jagre 04/23/1991 Hartford CT US
Roushan Kumar 10/22/1989 Stanford CT US
Suvir Gupta 07/25/1989 Houston TX US
Shlok Srivastav 06/23/1989 SF CA US
Raghavendra Singh Raghav 07/06/1988 Boston MA US

view raw
post14.csv
hosted with ❤ by GitHub

As you can see from above snippet, our input file post14.csv contains only three columns, namely, Name, Birth Date, and Location. Now, our goal is to transform the input file shown above into the hive schema as shown below.

hive table schema
hive table schema

If you compare the input post14.csv and the output hive table post14, you will come to know that the input three columns need to be transformed into the output eight columns and the transformation plan is as follows.

  • Input column 1: Name =>         Output column 1 & 2: fname + lname
  • Input column 2: Birth Date => Output column 3, 4 & 5: month + day + year
  • Input column 3: Location =>    Output column 6, 7 & 8: city + state + country

I hope above information makes it clear what input columns are transformed into what output columns. If this is the case, let us first create the file in the local file system and then put this file into HDFS location, which will be input location for the pig script.

The following commands are used for creating the file in local file system.

vi post14.csv

#paste the contents of post14.csv in the terminal window

#save the file by pressing (esc):wq(enter)

cat post14.csv

The following screenshot might help you to follow above steps.

input file content
input file content

Once the file is created in local file system, we need to load it into HDFS as pig script picks up the file from HDFS. The following commands are used for loading this file into HDFS location.

hadoop fs -mkdir -p /hdpcd/input/post14/
hadoop fs -put post14.csv /hdpcd/input/post14/
hadoop fs -cat /hdpcd/input/post14/post14.csv

The following screenshot might come handy to get a sense of the execution of the above commands.

pushing file to HDFS
pushing the file to HDFS

Now that we have the file in HDFS, let us take a look at the pig script we are going to run to transform the input data into the output hive table format. I have uploaded this pig script to my GitHub profile under HDPCD repository and it looks something like this.

–data transformation for matching schema with hive table
–LOAD command is used to load the data present in post14.csv file into input_data pig relation
–three columns are created while loading the data i.e. name, dob, and location
input_data = LOAD '/hdpcd/input/post14/post14.csv' USING PigStorage(',') AS (name:chararray, dob:chararray, location:chararray);
–actual data transformation operation starts now
–INPUT DATA: Milind Jagre,04/23/1991,Hartford CT US
–EXPACTED OUTPUT DATA: Milind,Jagre,4,23,1991,Hartford,CT,US
–name field is splitted into First Name and Last Name
–dob field is splitted into Month, Day, and Year for creating a clear segragation about the date fields
–location field splitted into city, state, and country for creating a clear segragation about the geographical fields
hive_data = FOREACH input_data GENERATE SUBSTRING(name, 0, INDEXOF(name, ' ', 0)) as fname,TRIM(SUBSTRING(name, INDEXOF(name, ' ', 0),100)) as lname,SUBSTRING(dob,0,2) as month,GetDay(ToDate(dob,'mm/dd/yyyy')) as day,GetYear(ToDate(dob,'mm/dd/yyyy')) as year,SUBSTRING(location, 0, INDEXOF(location, ' ', 0)) as city,TRIM(SUBSTRING(location, INDEXOF(location, ' ', 0), INDEXOF(location, ' ', 0)+3)) as state,TRIM(SUBSTRING(location, INDEXOF(location, ' ', 0)+3, INDEXOF(location, ' ', 0)+6)) as country;
–once we create schema required for hive table, we must store it in HDFS
–the data is delimited by tab character and directory is /hdpcd/output/post14
STORE hive_data INTO '/hdpcd/output/post14' USING PigStorage('\t');

view raw
post14.pig
hosted with ❤ by GitHub

If you take a look at the above file, you will notice that the transformation operation is done in line number 12. The pig relation hive_data contains the data which matches exactly with the hive schema. For doing this, we are using the following functions.

  • SUBSTRING(): This function is used for calculating the substring of a tuple. For performing this, you need the name of the column, the start index of the string, and the end index of the string.
  • TRIM(): This function is used for trimming purposes. The spaces in the string are eliminated with the help of this function.
  • INDEXOF(): This function is used for returning the index of a character in a string. This function is mainly used alongside SUBSTRING() function, so that the start index and the end index will be calculated with the help of INDEXOF() function.
  • GetDay(): This is a DATE function. As the name itself indicates, this function extracts the day in integer format from the input date.
  • GetYear(): This is also a DATE function. It gives us the year from in the input date format.

Once hive_data pig relation is constructed, we store this in the HDFS location ‘/hdpcd/output/post14’ with the help of STORE command. The delimiter in the generated output file is expressed with the help of PigStorage(‘\t’) function, indicating that the output file is TAB DELIMITED.

Now that we have the pig script ready, let us run this script with the help of the following command.

pig -f post14.pig

The following screenshot might be helpful to observe the output of the above command.

running pig script
running pig script

Above command gives us the following output.

pig script output terminal window
pig script output terminal window

The above screenshot suggests that a total of five records were imported in ‘/hdpcd/output/post14’.

The next logical step is to observe the output records. For doing this, we will execute the following commands.

hadoop fs -ls /hdpcd/output/post14
hadoop fs -cat /hdpcd/output/post14/part-v000-o000-r-00000

The output of these commands can be seen in below screenshot.

pig script output HDFS file
pig script output HDFS file

The output file is TAB DELIMITED and contains eight columns as expected. Now is the time to create the hive external table pointing to the location ‘/hdpcd/output/post14’ in HDFS.

We can use the following hive command to create the hive external table – post14.

create external table post14(
fname string,
lname string,
month int,
day int,
year int,
city string,
state string,
country string)
row format delimited
fields terminated by '\t'
location '/hdpcd/output/post14';

view raw
post14.hql
hosted with ❤ by GitHub

Above command is going to create the hive table and then we can run the “select” command to view the output in the hive table. This hive command looks as follows.

select * from post14;

The output of the above command looks as follows.

hive table output
hive table output

Above screenshot suggests that the pig script ran successfully and we got the output as expected. We can conclude this tutorial here.

To recap, we did input data transformation to match the hive schema using different functions supported by pig. For doing this, we loaded the input data from local file system to HDFS, created the pig script, ran the pig script, created the hive external table, and then ran sample query to check the output created by the pig script.

I hope that the screenshots and commands are helping you to understand the concepts. In the next tutorial, we are going to see the GROUP operation in pig. So please stay tuned for the updates.

Have fun. Cheers!

Published by milindjagre

I founded my blog www.milindjagre.co four years ago and am currently working as a Data Scientist Analyst at the Ford Motor Company. I graduated from the University of Connecticut pursuing Master of Science in Business Analytics and Project Management. I am working hard and learning a lot of new things in the field of Data Science. I am a strong believer of constant and directional efforts keeping the teamwork at the highest priority. Please reach out to me at milindjagre@gmail.com for further information. Cheers!

One thought on “Post 14 | HDPCD | Data Transformation to match Hive Schema using Apache Pig

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: