Hello everyone, thanks for coming back for the one more tutorial in this HDPCD certification series. In the last tutorial, we saw the process to transform the input data to match the hive schema. This tutorial focuses on the next functionality provided by Apache Pig – “GROUP” operation between one or more pig relations.
The GROUP operation in Apache PIG is quite similar to the SQL’s GROUP operation. Grouping can be performed based on a one or more column values. The output Pig relation contains only those records for which the columns passed in the GROUP clause have matching values. To explain this with an example, we are going to GROUP the weather data based on the station name. Let us start with this, then you will know what I am talking about.
Above shown csv file can be created with the help of vi editor and we use the following commands to do that.
PASTE post15.csv contents here
save this by pressing (esc):wq(enter)
Following screenshot might help you to understand above commands.
Now, since we are going to run the pig script in the TEZ MODE, we must put this file in the HDFS. We are going to use the following set of commands to achieve this.
hadoop fs -mkdir /hdpcd/input/post15
hadoop fs -put post15.csv /hdpcd/input/post15
hadoop fs -cat /hdpcd/input/post15/post15.csv
Please have a look at the following screenshot to get the idea about the execution of above commands.
This completes the input csv file loading operation.
As you can see this input file contains seven columns, which are explained as follows.
- Column 1: Station Name
- Column 2: Year
- Column 3: Month
- Column 4: Day
- Column 5: Precipitation
- Column 6: Maximum Temperature
- Column 7: Minimum Temperature
From above list of columns, we will perform the group operation based on the 1st column: Station Name. Therefore, if you observe all the stations in the input data file, stations LAX and DEN are repeated twice whereas SFO occurs only one. So, to set the output expectations right, we must have three records in the output file, one each for SFO, LAX, and DEN stations.
Now that the expectations are set for the output, let us start writing the pig script for doing the group operation. I have uploaded this script file to my GitHub profile under HDPCD repository with name 13_group_in_pig.pig. You can click here to download this pig script. It looks something like this.
|–GROUP OPERATION IN APACHE PIG|
|–loading weather data in weather relation|
|weather = LOAD '/hdpcd/input/post15/post15.csv' USING PigStorage(',');|
|–performing group operation based on station name|
|–station name is the first column in weather relation, therefore $0|
|grouped_data = GROUP weather BY $0;|
|–generating output data with FOREACH…GENERATE command|
|–output contains station name as the group and rest of the columns in weather relation|
|output_data = FOREACH grouped_data GENERATE group,weather;|
|–storing the final output in HDFS|
|STORE output_data INTO '/hdpcd/output/post15/';|
The above snippet indicates that there are a total of 4 commands to carry out this operation. Let us go through each command one by one.
weather = LOAD ‘/hdpcd/input/post15/post15.csv’ USING PigStorage(‘,’);
Above command is used for loading the input csv file in HDFS in weather pig relation. Since it is a csv file, we have used PigStorage(‘,’) for correctly loading the file.
grouped_data = GROUP weather BY $0;
This is the command which causes the grouping in pig. GROUP command makes sure that the group operation is carried out on the input data based on some column. In this case, we are passing it to be the first column and therefore we have given $0 in the above command.
output_data = FOREACH grouped_data GENERATE group,weather;
Above command, basically, created the output pig relation. The output data contains the group i.e. station name and their corresponding information.
STORE output_data INTO ‘/hdpcd/output/post15/’;
STORE command is responsible for storing the contents of the output_data relation in HDFS under /hdpcd/output/post15/ directory.
This indicates that out output files should be in /hdpcd/output/post15/.
Now, let us create this pig script and run it.
We use same vi editor for creating this pig script and then the pig command to run this file.
PASTE post15.pig contents here
save this by pressing (esc):wq(enter)
For your reference, please have a look at the below screenshot.
Now, since the pig script is ready now, it is time to run this script file, and we are going to use the following command to run this pig script.
pig -x tez post15.pig
The execution of this command looks as follows.
The output of this pig script looks as follows.
As you can see from the above screenshot, a total of 5 records were taken as input from the input post15.csv file and 3 records were created as the output in the defined HDFS output directory /hdpcd/output/post15, as expected.
We have also received the success message in the above image. Therefore, now it is time to check the output records.
Following commands are used for checking the status of the HDFS output directory.
hadoop fs -ls /hdpcd/output/post15
hadoop fs -cat /hdpcd/output/post15/part-v001-o000-r-00000
The output of above commands is shown in below screenshot.
As you can see in the above screenshot, there is a total of 3 records grouped according to the station names i.e. DEN, LAX, and SFO.
This confirms that the output is coming as expected and we can conclude this tutorial here.
Hope you are getting the concepts which I want to convey through these tutorials.
In the next tutorial, we are going to see how to remove the NULL values from a pig relation.
Stay tuned for the further updates. Thanks for having a read.