This is the first program in Spark + Python series. In this tutorial, we are going to see the Filter operation.
The objective of this tutorial is to print only those lines containing specified keyword. For doing this, we are going to follow below steps.
But before diving into actual operations, please look for the tools that we are going to use with the help of this link.
We are following below steps to achieve this goal.
- Create Input File in Local System.
- Put file in HDFS.
- Write Python Code.
- Run Python Code written.
- Verify the output.
STEP 1 : CREATE INPUT FILE IN LOCAL FILE SYSTEM
As discussed in the Spark + Python : Tools Setup tutorial, we are going to use Notepad++ to create an input file in the Local File System.
Please follow below screenshots for executing this step successfully.
- Right click on the directory name in NppFTP window in which you want to create a new file. In my case, I want to create a new file under spark_with_python directory, therefore I right clicked on that directory and then clicked on Create new file.
- It will open a new prompt asking for the file name. I am giving input.txt. It looks something like this.
- Then you should write the content in that file by opening it in Notepad++. You can write the content in this file just like a normal file. Once you are done typing the contents, save the file by pressing Ctrl+S.
- Once you save this file, you can see it being reflected in the same directory. This confirms the creation of input.txt.
STEP 2 : PUT FILE IN HDFS
Once the file is created in Local File System, out next task is to put this file in HDFS.
For this, we will use hadoop fs -put command.
The syntax looks like this
$hadoop fs -put input.txt /
Above screenshot shows the successful transfer of the file from local file system to HDFS.
After successful transfer, our HDFS looks like this.
STEP 3 : WRITE PYTHON CODE
For writing Python code, we are going to use Notepad++. I have uploaded this code on my github profile and you can see it below.
|from pyspark import SparkConf, SparkContext|
|conf = SparkConf().setMaster("local").setAppName("Filter")|
|sc = SparkContext(conf = conf)|
|lines = sc.textFile("hdfs://localhost:54310/input.txt")|
|filter_lines = lines.filter(lambda x: "Milind" in x)|
|linecount = filter_lines.count()|
|i = 1|
|for line in filter_lines.take(linecount):|
|print "LINE " , i , " " + line|
|i = i+1|
It looks as follows in Notepad++ window.
STEP 4 : RUN WRITTEN PYTHON CODE
We run above python code by using spark-submit command and the syntax looks as follows.
Below screenshot shows the starting of execution.
STEP 5 : VERIFY THE OUTPUT
The output is verified by carefully looking at the output trace.
Since our input file contains only one line containing Milind keyword, the output should also show us only one line.
Below screenshot shows the output window.
As you can see, it printed only one line, therefore we can say that our code is working fine and we got the expected output.
Hope this helps.
Have a good one.