Post 16 | HDPCD | Removing records with NULL values from a Pig Relation

Hey, everyone, it’s been so long. Been busy with the final exams and presentations for the last couple of weeks. Now that the semester is over, you can expect frequent and more detailed updates from this blog. So, let us get started then.

In the last tutorial, we saw the GROUP operation using Apache Pig. This tutorial focuses on removing records which contain NULL values. It is one of the steps in data preprocessing which we use while doing the Text Mining.

Let us take a look at the steps we are going to follow to achieve it.

operation flowchart

Above flowchart shows us the step by step process of solving this objective in the HDPCD certification. We will perform each task in a sequence in the following way.


The input file is created in the local file system with the help of vi editor. For demonstration purposes, I have deliberately put NULL values in the file. I have uploaded this file to my GitHub profile in HDPCD repository and you can download it from here.

Once you do download the file, the following commands can be used to create this in your local file system.

vi post16.csv





cat post16.csv

The following screenshot might be helpful for you.

input file creation
input file creation

This input file looks something like this.

We can make this file beautiful and searchable if this error is corrected: No commas found in this CSV file in line 0.
Hi This is Milind.
Hi Big Data
This file contains 2 empty lines
and 3 non empty lines

view raw
hosted with ❤ by GitHub


Now, the file is in the local file system. Let us push it to HDFS.

We are going to use the following set of commands to do this.

hadoop fs -mkdir /hdpcd/input/post16

hadoop fs -put post16.csv /hdpcd/input/post16

hadoop fs -cat /hdpcd/input/post16/post16.csv

The following screenshot will give you an idea about the execution of the above commands.

input file to HDFS
input file to HDFS

Now, we have the file in HDFS, it is time to create the pig script. I have uploaded this pig script to my GitHub profile under HDPCD repository and you can download it from here.

It looks something like this.

–Removing records with NULL values in pig relation
–loading the data in input_data relation
input_data = LOAD '/hdpcd/input/post16/post16.csv' USING PigStorage() AS (line:chararray);
–performing filter operation to remove records with NULL values
filtered_data = FILTER input_data BY line IS NOT NULL;
–storing the final output in HDFS
STORE filtered_data INTO '/hdpcd/output/post16';

view raw
hosted with ❤ by GitHub

Let us look at each command.

input_data = LOAD ‘/hdpcd/input/post16/post16.csv’ USING PigStorage() AS (line:chararray);

Above command is responsible for loading the data present in the post16.csv file into the pig relation input_data. The variable name is line and the datatype of line variable is chararray. This means that each line is denoted by line variable with chararray datatype.

filtered_data = FILTER input_data BY line IS NOT NULL;

Above command is responsible for removing the records which contain the NULL values in any of the columns. If you remember, our input file contains a total of 6 lines, out of which 2 are empty and 4 lines are non-empty. Therefore once we execute the above command, a total of 4 lines will remain and 2 lines will be removed.

STORE filtered_data INTO ‘/hdpcd/output/post16’;

This command is going to store the input data into the HDFS output directory /hdpcd/output/post16.

I hope the above explanation makes sense.

Once you download post16.pig file, you can use the following commands in the vi editor to create this file.

vi post16.pig





cat post16.pig

The following screenshot might come handy.

pig script for NULL removal
pig script for NULL removal

So, we are ready to run this pig script.


Now, it is time to run the above-created pig script. We will use the following command to run this pig script.

pig -x tez post16.pig

The below screenshot shows us the output of the above command.

running pig script
running pig script

Once the script, we will get some output and this output looks like this.

pig script output
pig script output

From the above screenshot, we can clearly see that our pig script ran successfully and a total of 4 records (4 lines) got stored in the output file under /hdpcd/output/post16 HDFS directory.

After this, let us look at the output.


We will go through the HDFS directory /hdpcd/output/post16 and print the contents of the output file to see the results.

For doing this, we will run the following commands.

hadoop fs -ls /hdpcd/output/post16

hadoop fs -cat /hdpcd/output/post16/part-v000-o000-r-00000

Let us observe the output of the above two commands.

HDFS output file
HDFS output file

The above screenshot confirms that the output is as per our expectations and we can conclude this tutorial here.

I hope you liked the content. In the next tutorial, we are going to see how to store the output data in HDFS, like we are doing in every post.

Please follow my blog for further updates. You can click here to subscribe to my YouTube channel. You can like here my facebook page here and follow me on twitter here. Please check out my LinkedIn profile here.

Have fun people. Cheers!

Published by milindjagre

I founded my blog four years ago and am currently working as a Data Scientist Analyst at the Ford Motor Company. I graduated from the University of Connecticut pursuing Master of Science in Business Analytics and Project Management. I am working hard and learning a lot of new things in the field of Data Science. I am a strong believer of constant and directional efforts keeping the teamwork at the highest priority. Please reach out to me at for further information. Cheers!

2 thoughts on “Post 16 | HDPCD | Removing records with NULL values from a Pig Relation

  1. Heya, Thank you for sharing this post, I liked itvery much! Can’t wait to see more posts! PS. If you need free themes and plugins, hit me up. Have a nice day!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: