WordCount in Spark

Hello friends,

Today we are going to implement the very famous WordCount code in Spark in spark-shell.

For folks who are not familiar with WordCount, in this implementation, we count the occurrences of each word and as a result present a pair of word and their respective count.
For example, if my input is as follows

Hi this is Milind
Hi Big Data

Then WordCount output will look something like this
Hi => 2
this => 1
is => 1
Milind => 1
Big => 1
Data => 1

You can clearly see that the left-hand side of the arrow indicates each word in the input whereas the right-hand side indicates the respective count of each word.

Now that we know what WordCount is, we will proceed with the implementation in Spark Shell.
In this, we are going to follow below steps.


Now we will look into these steps one by one.


We are going to need an input file for implementing the WordCount logic.
We will use nano command to create a file.
Following screenshots will guide you regarding creating a text file with the help of nano command.

nano command syntax to create input.txt

input.txt creation
input.txt creation

writing contents in input.txt file

input.txt contents
input.txt contents

verifying contents written successfully in input.txt

input.txt contents confirmation
input.txt contents confirmation


opening spark shell to execute spark WordCount commands one by one

starting spark shell
starting spark shell


input data from input.txt
input data from input.txt

As you can see in above screenshot, we are taking input.txt as an input file.
Once input.txt file is loaded in variable called input_file, then we are printing the contents of that input_file variable to verify that the file got loaded successfully.
Now, the next step as discussed is to split the input data in words and then count each word those many numbers of time.

wordcount logic
wordcount logic

As you can see from above picture, WordCount logic is divided into three parts.
We will see those parts like follows.

  • Step 1 : Split each line with SPACE (” “) as a delimiter.
  • Step 2 : Map each word with a default count of 1.
  • Step 3 : Apply reduce operation which will by default make each word as key and sum the values for all keys.

At last we are printing the value of output variable called step3.
Printing the output variable, we are now certain that WordCount executed successfully.

Now it is time to store this output variable in a file.
This is done with the help of saveAsTextFile() command shown in the screenshot.


final wordcount output
final wordcount output

Once you execute store command, an output directory will be created in the same directory.
You can run ls command to check the output directory content and you will see the _SUCCESS file and the part file.
You can print the contents of the part file to view the final output.
All these steps are shown in the screenshot above.

The entire file is uploaded to my GitHub profile which looks something like this.

val input_file=sc.textFile("input.txt")
val step1=input_file.flatMap(line => line.split(" "))
val step2=step1.map(word => (word,1))
val step3=step2.reduceByKey(_+_)

view raw
Spark WordCount
hosted with ❤ by GitHub

I believe the explaination and screenshots help.
Hope you have a great read.

Kindly let me know your thoughts.

Published by milindjagre

I founded my blog www.milindjagre.co four years ago and am currently working as a Data Scientist Analyst at the Ford Motor Company. I graduated from the University of Connecticut pursuing Master of Science in Business Analytics and Project Management. I am working hard and learning a lot of new things in the field of Data Science. I am a strong believer of constant and directional efforts keeping the teamwork at the highest priority. Please reach out to me at milindjagre@gmail.com for further information. Cheers!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: