In this tutorial, we are going to see how the Union operation works.
In English Language, union means combining two things. Here, we are also going to do the same thing. The difference is, we are going to attach two RDDs using Union operation.
We are using the same input.txt file we used in last tutorial. To achieve Union operation, we will first filter this file based on two keywords, “Milind” and “fun”. Once we get RDDs corresponding to these two keywords, we will union these two RDDs and will store the output in the third RDD. We will print the output of this third RDD which is known as our final output RDD.
We are using following python code for executing this task.
|from pyspark import SparkConf, SparkContext|
|conf = SparkConf().setMaster("local").setAppName("Filter")|
|sc = SparkContext(conf = conf)|
|lines = sc.textFile("hdfs://localhost:54310/input.txt")|
|milind_lines = lines.filter(lambda x: "Milind" in x)|
|fun_lines = lines.filter(lambda x: "fun" in x)|
|final_output_rdd = milind_lines.union(fun_lines)|
|linecount = final_output_rdd.count()|
|i = 1|
|for line in final_output_rdd.take(linecount):|
|print "LINE " , i , " " + line|
|i = i+1|
Following are the step by step screenshots of code execution.
The code looks like following when we write it in Notepad++.
After writing above code, we execute it with the help of following command.
$ spark-submit union.py
It is evident from the below screenshot.
Above executed command gives us the following output.
As you can see in the original file, line 1 and 4 contained “Milind” and “fun” keyword respectively which is printed on the output terminal. LINE 1 and LINE 2 is printed on the terminal window to show line number in the final RDD and not the original file. Hope this clears some possible confusion about the output.
In this way, we implement the Union operation in Apache Spark with Python API.
Hope you had a great read.