In the previous tutorial, we saw how to load the Pig Relation without a defined schema. In this tutorial, we are going to load a Pig Relation with a properly defined schema.
It is exactly similar to the last tutorial, except for one step, which I will discuss in a moment. Please have a look at the below infographics which depict the step by step process to construct a Pig Relation out of an input HDFS file with a defined schema.
As you can see from the above picture, the process is exactly the same.
So let us begin performing all the steps mentioned above.
Let us look at the input data that we have to load into the Pig Relation.
After taking a look at the input data shown above, we can say following things
- Number of Columns: 7
- Column Datatype: String, int, int, int, int, int, int
- Column Separator: comma (,)
- Column Names: Our choice, we can give these columns any names that we want
Let us create this file in Local File System with the help of vi command.
Once I copy paste the contents in the terminal window and save the file, cat command gives me the following output.
Now that I have the file in my Local File System, it is time to put it in HDFS, as we are going to use the MapReduce mode to run the Pig script. Following are the commands which we use to load this input.csv file from Local File System to HDFS.
hadoop fs -mkdir -p /hdpcd/input/post11
hadoop fs -put input.csv /hdpcd/input/post11
hadoop fs -cat /hdpcd/input/post11/input.csv
Following is the screenshot confirming that above commands run successfully and we get the desired input.csv file in HDFS under /hdpcd/input/post11 directory.
According to the infographics shared above, we are done with step 3 with this.
|–We use LOAD command to load data into a PIG Relation|
|–The keyword "AS" indicates the Pig Relation creation with a defined schema|
|data_with_schema = LOAD '/hdpcd/input/post11/input.csv' USING PigStorage(',') AS (station_name:chararray, year:int, month:int, dayofmonth:int, precipitation:int, maxtemp:int, mintemp:int);|
|–Dumping the structure of the Pig Relation "data_with_schema" created in the above line|
|–Dumping the actual data stored in Pig Relation "data_with_schema"|
If you remember, in the beginning of this tutorial, I mentioned that there is only one difference between this and the previous tutorial, and that difference is mentioned in the Line number 4 in above post11.pig file.
In above file, we are mentioning the schema i.e. name and datatype of each column, which should be applied for the newly created Pig Relation “data_with_schema”.
The command DESCRIBE and DUMP are executed to confirm that the schema was created and the data was pushed to the Pig Relation successfully.
I used the following command to run this Pig Script.
pig -f post11.pig
Following screenshot depicts the execution that gets started once we execute above command.
As you can see in the above screenshot, the data was loaded with the schema.
This confirms that the DESCRIBE command ran successfully. Now, let us the output of DUMP command.
Above screenshot confirms that DUMP command is giving us the expected output.
This confirms that our script ran as expected and we got the intended result. This concludes the objective of this tutorial.
I hope the tutorials are making sense and helping you in terms of the concepts and the contents.
In the next tutorial, we are going to see how to load the data from a Hive table into a Pig Relation.
You can subscribe to my YouTube channel by clicking here for the video tutorials of HDPCD certification.