Post 11 | HDPCD | Load Pig Relation WITH schema

In the previous tutorial, we saw how to load the Pig Relation without a defined schema. In this tutorial, we are going to load a Pig Relation with a properly defined schema.

It is exactly similar to the last tutorial, except for one step, which I will discuss in a moment. Please have a look at the below infographics which depict the step by step process to construct a Pig Relation out of an input HDFS file with a defined schema.

Apache Pig Relation With Schema
Apache Pig Relation With Schema

As you can see from the above picture, the process is exactly the same.

So let us begin performing all the steps mentioned above.

Let us look at the input data that we have to load into the Pig Relation.

Input Content
Input Content

For your reference, I have uploaded this file to my GitHub profile at this location and it looks as follows.


SFO 2008 1 1 90 100 65
LAX 2008 1 2 89 111 67
SFO 2008 1 1 90 100 65
LAX 2008 1 2 89 111 67
DEN 2008 1 3 88 123 67
LAX 2009 10 1 12 132 34
DEN 2007 12 12 90 111 11

view raw
input.csv
hosted with ❤ by GitHub

After taking a look at the input data shown above, we can say following things

  • Number of Columns: 7
  • Column Datatype: String, int, int, int, int, int, int
  • Column Separator: comma (,)
  • Column Names: Our choice, we can give these columns any names that we want

Let us create this file in Local File System with the help of vi command.

Once I copy paste the contents in the terminal window and save the file, cat command gives me the following output.

cat command output
cat command output

Now that I have the file in my Local File System, it is time to put it in HDFS, as we are going to use the MapReduce mode to run the Pig script. Following are the commands which we use to load this input.csv file from Local File System to HDFS.

hadoop fs -mkdir -p /hdpcd/input/post11

hadoop fs -put input.csv /hdpcd/input/post11

hadoop fs -cat /hdpcd/input/post11/input.csv

Following is the screenshot confirming that above commands run successfully and we get the desired input.csv file in HDFS under /hdpcd/input/post11 directory.

Pushing Input File to HDFS
Pushing Input File to HDFS

According to the infographics shared above, we are done with step 3 with this.

Now, let us build our Pig Script. I have uploaded this Pig Script on my GitHub profile and you can download it here. It looks as follows.

–We use LOAD command to load data into a PIG Relation
–The keyword "AS" indicates the Pig Relation creation with a defined schema
data_with_schema = LOAD '/hdpcd/input/post11/input.csv' USING PigStorage(',') AS (station_name:chararray, year:int, month:int, dayofmonth:int, precipitation:int, maxtemp:int, mintemp:int);
–Dumping the structure of the Pig Relation "data_with_schema" created in the above line
DESCRIBE data_with_schema;
–Dumping the actual data stored in Pig Relation "data_with_schema"
DUMP data_with_schema;

If you remember, in the beginning of this tutorial, I mentioned that there is only one difference between this and the previous tutorial, and that difference is mentioned in the Line number 4 in above post11.pig file.

In above file, we are mentioning the schema i.e. name and datatype of each column, which should be applied for the newly created Pig Relation “data_with_schema”.

The command DESCRIBE and DUMP are executed to confirm that the schema was created and the data was pushed to the Pig Relation successfully.

I used the following command to run this Pig Script.

pig -f post11.pig

Following screenshot depicts the execution that gets started once we execute above command.

Running Pig Script
Running Pig Script

As you can see in the above screenshot, the data was loaded with the schema.

station_name: chararray

year: int

month: int

dayofmonth: int

precipitation: int

maxtemp: int

mintemp: int

This confirms that the DESCRIBE command ran successfully. Now, let us the output of DUMP command.

Pig Script Output
Pig Script Output

Above screenshot confirms that DUMP command is giving us the expected output.

This confirms that our script ran as expected and we got the intended result. This concludes the objective of this tutorial.

I hope the tutorials are making sense and helping you in terms of the concepts and the contents.

In the next tutorial, we are going to see how to load the data from a Hive table into a Pig Relation.

You can subscribe to my YouTube channel by clicking here for the video tutorials of HDPCD certification.

Cheers!

 

Published by milindjagre

I founded my blog www.milindjagre.co four years ago and am currently working as a Data Scientist Analyst at the Ford Motor Company. I graduated from the University of Connecticut pursuing Master of Science in Business Analytics and Project Management. I am working hard and learning a lot of new things in the field of Data Science. I am a strong believer of constant and directional efforts keeping the teamwork at the highest priority. Please reach out to me at milindjagre@gmail.com for further information. Cheers!

One thought on “Post 11 | HDPCD | Load Pig Relation WITH schema

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: