Spark + Python : Union Operation

Spark + Python : Union Operation

In this tutorial, we are going to see how the Union operation works.

In English Language, union means combining two things. Here, we are also going to do the same thing. The difference is, we are going to attach two RDDs using Union operation.

We are using the same input.txt file we used in last tutorial. To achieve Union operation, we will first filter this file based on two keywords, “Milind” and “fun”. Once we get RDDs corresponding to these two keywords, we will union these two RDDs and will store the output in the third RDD. We will print the output of this third RDD which is known as our final output RDD.

We are using following python code for executing this task.

from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("Filter")
sc = SparkContext(conf = conf)
lines = sc.textFile("hdfs://localhost:54310/input.txt")
milind_lines = lines.filter(lambda x: "Milind" in x)
fun_lines = lines.filter(lambda x: "fun" in x)
final_output_rdd = milind_lines.union(fun_lines)
linecount = final_output_rdd.count()
i = 1
for line in final_output_rdd.take(linecount):
print "——-"
print "LINE " , i , " " + line
i = i+1
print "——-"

view raw
hosted with ❤ by GitHub

Following are the step by step screenshots of code execution.

The code looks like following when we write it in Notepad++.

Union Operation in Spark RDD
Union Operation in Spark RDD

After writing above code, we execute it with the help of following command.

$ spark-submit

It is evident from the below screenshot.

Union Operation Execution command
Union Operation Execution command

Above executed command gives us the following output.

Union Operation Output
Union Operation Output

As you can see in the original file, line 1 and 4 contained “Milind” and “fun” keyword respectively which is printed on the output terminal. LINE 1 and LINE 2 is printed on the terminal window to show line number in the final RDD and not the original file. Hope this clears some possible confusion about the output.

In this way, we implement the Union operation in Apache Spark with Python API.

Hope you had a great read.



Published by milindjagre

I founded my blog four years ago and am currently working as a Data Scientist Analyst at the Ford Motor Company. I graduated from the University of Connecticut pursuing Master of Science in Business Analytics and Project Management. I am working hard and learning a lot of new things in the field of Data Science. I am a strong believer of constant and directional efforts keeping the teamwork at the highest priority. Please reach out to me at for further information. Cheers!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: