Post 15 | HDPCD | Group Data in one or more PIG Relations

Hello everyone, thanks for coming back for the one more tutorial in this HDPCD certification series. In the last tutorial, we saw the process to transform the input data to match the hive schema. This tutorial focuses on the next functionality provided by Apache Pig – “GROUP” operation between one or more pig relations.

The GROUP operation in Apache PIG is quite similar to the SQL’s GROUP operation. Grouping can be performed based on a one or more column values. The output Pig relation contains only those records for which the columns passed in the GROUP clause have matching values. To explain this with an example, we are going to GROUP the weather data based on the station name. Let us start with this, then you will know what I am talking about.

The input data is uploaded to my GitHub profile under HDPCD repository and you can download it by clicking here. For your reference, this input data post15.csv looks as follows.


SFO 2008 1 1 90 100 65
LAX 2008 1 2 89 111 67
DEN 2008 1 3 88 123 67
LAX 2009 10 1 12 132 34
DEN 2007 12 12 90 111 11

view raw
post15.csv
hosted with ❤ by GitHub

Above shown csv file can be created with the help of vi editor and we use the following commands to do that.

vi post15.csv

###

PASTE post15.csv contents here

save this by pressing (esc):wq(enter)

###

cat post15.csv

Following screenshot might help you to understand above commands.

pig group operation input file
pig group operation input file

Now, since we are going to run the pig script in the TEZ MODE, we must put this file in the HDFS. We are going to use the following set of commands to achieve this.

hadoop fs -mkdir /hdpcd/input/post15

hadoop fs -put post15.csv /hdpcd/input/post15

hadoop fs -cat /hdpcd/input/post15/post15.csv

Please have a look at the following screenshot to get the idea about the execution of above commands.

pushing input file in HDFS
pushing input file in HDFS

This completes the input csv file loading operation.

As you can see this input file contains seven columns, which are explained as follows.

  • Column 1: Station Name
  • Column 2: Year
  • Column 3: Month
  • Column 4: Day
  • Column 5: Precipitation
  • Column 6: Maximum Temperature
  • Column 7: Minimum Temperature

From above list of columns, we will perform the group operation based on the 1st column: Station Name. Therefore, if you observe all the stations in the input data file, stations LAX and DEN are repeated twice whereas SFO occurs only one. So, to set the output expectations right, we must have three records in the output file, one each for SFO, LAX, and DEN stations.

Now that the expectations are set for the output, let us start writing the pig script for doing the group operation. I have uploaded this script file to my GitHub profile under HDPCD repository with name 13_group_in_pig.pig. You can click here to download this pig script. It looks something like this.

–GROUP OPERATION IN APACHE PIG
–loading weather data in weather relation
weather = LOAD '/hdpcd/input/post15/post15.csv' USING PigStorage(',');
–performing group operation based on station name
–station name is the first column in weather relation, therefore $0
grouped_data = GROUP weather BY $0;
–generating output data with FOREACH…GENERATE command
–output contains station name as the group and rest of the columns in weather relation
output_data = FOREACH grouped_data GENERATE group,weather;
–storing the final output in HDFS
STORE output_data INTO '/hdpcd/output/post15/';

view raw
post15.pig
hosted with ❤ by GitHub

The above snippet indicates that there are a total of 4 commands to carry out this operation. Let us go through each command one by one.

weather = LOAD ‘/hdpcd/input/post15/post15.csv’ USING PigStorage(‘,’);

Above command is used for loading the input csv file in HDFS in weather pig relation. Since it is a csv file, we have used PigStorage(‘,’) for correctly loading the file.

grouped_data = GROUP weather BY $0;

This is the command which causes the grouping in pig. GROUP command makes sure that the group operation is carried out on the input data based on some column. In this case, we are passing it to be the first column and therefore we have given $0 in the above command.

output_data = FOREACH grouped_data GENERATE group,weather;

Above command, basically, created the output pig relation. The output data contains the group i.e. station name and their corresponding information.

STORE output_data INTO ‘/hdpcd/output/post15/’;

STORE command is responsible for storing the contents of the output_data relation in HDFS under /hdpcd/output/post15/ directory.

This indicates that out output files should be in /hdpcd/output/post15/.

Now, let us create this pig script and run it.

We use same vi editor for creating this pig script and then the pig command to run this file.

vi post15.pig

###

PASTE post15.pig contents here

save this by pressing (esc):wq(enter)

###

cat post15.pig

For your reference, please have a look at the below screenshot.

creating pig script for group operation
creating pig script for group operation

Now, since the pig script is ready now, it is time to run this script file, and we are going to use the following command to run this pig script.

pig -x tez post15.pig

The execution of this command looks as follows.

running pig script for group operation
running pig script for group operation

The output of this pig script looks as follows.

pig script output for group operation
pig script output for group operation

As you can see from the above screenshot, a total of 5 records were taken as input from the input post15.csv file and 3 records were created as the output in the defined HDFS output directory /hdpcd/output/post15, as expected.

We have also received the success message in the above image. Therefore, now it is time to check the output records.

Following commands are used for checking the status of the HDFS output directory.

 hadoop fs -ls /hdpcd/output/post15

hadoop fs -cat /hdpcd/output/post15/part-v001-o000-r-00000

The output of above commands is shown in below screenshot.

group operation output in pig
group operation output in pig

As you can see in the above screenshot, there is a total of 3 records grouped according to the station names i.e. DEN, LAX, and SFO.

This confirms that the output is coming as expected and we can conclude this tutorial here.

Hope you are getting the concepts which I want to convey through these tutorials.

In the next tutorial, we are going to see how to remove the NULL values from a pig relation.

You can click here to subscribe to my YouTube channel for the video tutorials. You can like my Facebook page here and follow me on twitter here.

Stay tuned for the further updates. Thanks for having a read.

Cheers!

Published by milindjagre

I founded my blog www.milindjagre.co four years ago and am currently working as a Data Scientist Analyst at the Ford Motor Company. I graduated from the University of Connecticut pursuing Master of Science in Business Analytics and Project Management. I am working hard and learning a lot of new things in the field of Data Science. I am a strong believer of constant and directional efforts keeping the teamwork at the highest priority. Please reach out to me at milindjagre@gmail.com for further information. Cheers!

3 thoughts on “Post 15 | HDPCD | Group Data in one or more PIG Relations

  1. Hi Milind,
    Aren’t the commands on line 8 and line 12 doing the same thing? I got the same result from both of them. Is there any difference that I am missing to understand?

    1. Thanks for the question.

      The commands at lines 8 and 12 accomplish different things.
      The command at line 8 groups the input file by the first column i.e. Station Name
      The command at line 12 iterates through the output generated by line 8 and it groups the entire line/record by the Station Name

      Please correct me if I am wrong. Appreciate it.

      Best,
      Milind

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: