In this tutorial, we are going to various ways in which we pass functions in Spark using Python API.
I have shown two ways in which functions can be called/created (for user-defined function).
We are going to do the comparison based on filtering capabilities of Spark. For doing this I have created a user-defined function called containsMilind() which you can see in line no 4 of below python code.
Apart from this containsMilind() function, I am using one more function called filter() which you can see in line 7 of below python code.
I would request you guys to go through the code and understand what I am doing with each function.
|from pyspark import SparkConf, SparkContext|
|conf = SparkConf().setMaster("local").setAppName("Filter")|
|sc = SparkContext(conf = conf)|
|return "Milind" in s|
|lines = sc.textFile("hdfs://localhost:54310/input.txt")|
|lambda_function = lines.filter(lambda x: "Milind" in x)|
|linecount = lambda_function.count()|
|i = 1|
|for line in lambda_function.take(linecount):|
|print "LAMBDA FUNCTION LINE " , i , " " + line|
|i = i+1|
|user_function = lines.filter(containsMilind)|
|linecount = user_function.count()|
|i = 1|
|for line in user_function.take(linecount):|
|print "USER FUNCTION LINE " , i , " " + line|
|i = i+1|
Let me explain the code a little bit.
In above code, I am trying to filter out all the lines that contain the keyword “Milind“. I am doing this by two ways. The first way is to create a user-defined function called containsMilind().
- The first way is to create a user-defined function called containsMilind(). In containsMilind(), I am using Python’s functionality to return either true or false. If the input string (each line in input file) contains the keyword “Milind“, then that will appear in the output, otherwise that line will be omitted.
- The second way to get the output is using the default filter() function. It is simpler to write and use. It is a Python function which expects a lambda function as a formal argument. This lambda function is responsible for filter operation on each line in a spark RDD. Each RDD is evaluated for “Milind” keyword. Lines containing “Milind” keyword gets printed while the rest of the lines are omitted.
Following screenshot shows the python code written in Notepad++.
Following is the command which we use to run this Python code.
Below screenshot shows the output of LAMBDA function i.e. filter() function.
Below screenshot shows the output of USER-DEFINED function i.e. containsMilind() function.
I hope you find this article useful and interesting like I did.
Have a good one. Cheers!