In the previous tutorial, we saw how to load the data from Apache Hive to Apache Pig. If you remember, we used HCatalog for performing that operation. In this tutorial, we are going to see the process of doing the data transformation using Apache Pig. The process of data transformation itself is too involved and can differ from one requirement to other.
For the certification, in simple words, I can say that data transformation is the process of changing the values present in some of the fields AND/OR filtering some of the columns from the input data. Basically, it is the process of altering the input data content in any way possible.
Now that the definition is out of the way, let us dive into the main topic of this tutorial and that is what we are going to do in this tutorial. So, let me answer that question for you. The answer for that is that we are going to extract the FIRST NAME, LOCATION, and PROGRAM NAME from the student information file which is our input file.
For doing this, we are going to follow the process mentioned in the following infographics.
As you can see in the above picture, we are going to follow the below process.
- Create input file in Local File System
- Push this file in HDFS
- Create Pig Script for doing Data Transformation
- Run this Pig Script
- Observe the output
So let us perform each operation one step at a time.
- INPUT FILE CREATION
I have uploaded the input file for this operation to my GitHub profile and it looks something like this.
|4,piyush,hore,new york city,1980,ms|
As you can see from the above snippet, the file is a Comma Delimited (CSV)and there is a total of six fields, which I can describe as follows.
- first name
- last name
- birth year
- program name
As already mentioned in the objective of this tutorial, we want to extract the first name, location and the program name of each student. Therefore, we will have to extract field number 2, 4, and 6. Talking in terms of indexes, these points get converted to 1, 3, and 5. Take note of these indices as these are going to be useful while creating the Pig Script.
Now, let us see how to create this input file in Local File System.
We use vi command to create this file in Local File System. Once you run this command, new screen pops up, where you can paste the content of post13.txt file shown in above snippet. Once you are done copying and pasting the contents, press (esc):wq to save the contents to that file. You can use the cat command to print the contents back to the terminal window, to make sure you did the job right.
The commands are as follows.
This is the way to create the file in Local File System. Let us move on to the step 2 to load this file to HDFS.
- LOADING FILE TO HDFS
This step is as simple as it gets. You use put command to load this post13.txt file to HDFS.
The commands are as follows.
hadoop fs -mkdir -p /hdpcd/input/post13
hadoop fs -put post13.txt /hdpcd/input/post13
hadoop fs -cat /hdpcd/input/post13/post13.txt
As you can see from the above commands, we are first creating the directory post13 to load this file to HDFS. Once the directory is created successfully, we load post13.txt to this directory. As soon as we are done with the third command, we can move ahead with the Pig Script creation part of the tutorial.
To give you an idea, you can refer to below screenshot, which gives you an idea about the expected output of each command that we ran till now.
Above screenshot confirms that all the commands ran successfully and we can create the Pig Script now.
- PIG SCRIPT CREATION
The Pig Script creation is done by following the steps that we have to do the data transformation. In the Pig Script, we have to follow below steps.
- load HDFS file in a Pig Relation
- Extract required columns from above Pig Relation and create a new Pig Relation
- Print the contents of the new Pig Relation
Now that we know the steps we have to include in our Pig Script, let us see how it looks like. I have uploaded this Pig Script to my GitHub profile and it looks as follows.
|–this file is used for data transformation using Apache Pig|
|–we will load the data from post13.txt file and transform it|
|–LOAD command is used for loading the data in input_data pig relation|
|input_data = LOAD '/hdpcd/input/post13/post13.txt' USING PigStorage(',');|
|–input_data pig relation is transformed and we extract only|
|–first name, location and the program from the original input|
|flat_data = FOREACH input_data GENERATE $1 as fname, $3 as location, $5 as program;|
|–at last, we print the flat_data pig relation for confirmation|
|–DUMP command is used for printing the pig relation contents to the terminal window|
Let us look at each line one by one to get a sense of what is going on.
Line 5: input_data = LOAD ‘/hdpcd/input/post13/post13.txt’ USING PigStorage(‘,’);
Explanation: LOAD command is used for loading the data from a text file in HDFS into a Pig Relation. In above line, post13.txt file contents are loaded into a Pig Relation input_data. USING is a keyword followed by the delimiter in the post13.txt. In this case, post13.txt is the comma delimited file, therefore we have written PigStorage(‘,’) in above command.
Line 9: flat_data = FOREACH input_data GENERATE $1 as fname, $3 as location, $5 as program;
Explanation: This command will iterate through every BAG present in the input_data Pig Relation. The FOREACH keyword makes sure that this iteration takes place over input_data to generate the output consisting of LAST NAME, LOCATION, and PROGRAM NAME. process. The output of this operation is stored in a new Pig Relation called flat_data.
Line 13: dump flat_data
Explanation: This command is used for printing the contents of the flat_data Pig Relation on the command prompt. By looking at the contents of the flat_data Pig Relation, we can conclude whether it worked successfully or not.
- RUNNING PIG SCRIPT
As we are ready with our Pig Script, it is time to execute this Pig Script. We are going to use the default MapReduce mode as mentioned in this post. We use pig command to run the Pig Script and it looks as follows.
pig -f post13.pig
Let us look at the command prompt once we run the Pig Script using above command. It looks as follows.
- OBSERVE THE OUTPUT
Let us take a look at the output. How it looks.
As you can see from above screenshot, the Pig Script is giving output as FIRST NAME, LOCATION, and PROGRAM NAME separated by a comma which was our expectation. From the screenshot, we can say that our Pig Script worked successfully and we are getting the expected output and we can conclude the data transformation worked according to our requirements.
This concludes our tutorial on Data Transformation using Apache Pig. I hope the tutorial enables you to understand what we are trying to do here. In the next tutorial, we are going to see the Data Transformation to match Hive schema. So, please stay tuned and follow my blog for further updates.
Follow me on Facebook and Twitter and you can subscribe to my YouTube channel by clicking here.
Have fun, everyone.