Hope you are finding the tutorials helpful.
In this tutorial, we are going to see the two transformations which we are going to use a lot while learning Spark. Both map() and flatMap() functions are transformations in Spark. We will discuss these two transformations one by one. Then will see the similarities between these two followed by the differences.
- map() transformation
We can use map() function to do number of things. We can perform transformations on any data i.e. numbers and strings.
The input and output to this transformation is an RDD.
The input and output data types may not be the same for map() function. It means that if input is RDD[String], it does not mean the output is also going to be RDD[String], it can be RDD[int] also and because of this quality it is called as transformation.
each element in RDD -> map() -> output of each element in new RDD
- flatMap() transformation
flatMap() function is somewhat similar and different as compared to flatMap() function.
flatMap() is called on each element in an RDD and it can produce more than one output element for each element in the input RDD.
- SIMILARITIES between map() and flatMap()
Both map() and flatMap() are transformations.
Both map() and flatMap() expect input and output as RDD.
- DIFFERENCES between map() and flatMap()
map() function is applied on each element and produces new value for each element in the output RDD. flatMap() function instead of returning values in output RDD, it returns an iterator with return values.
map() output is an RDD whereas flatMap() output is RDD containing elements of all iterators.
Below picture shows the basic difference between map() and flatMap() when applied on same input RDD.
As you can see, mapRDD contains list of elements broken down based on SPACE as the delimiter. flatMapRDD contains all the elements as one single list, there are no lists in the resulting RDD.
We are going to demonstrate the above-mentioned difference programmatically with the help of following python file.
|from pyspark import SparkConf, SparkContext|
|conf = SparkConf().setMaster("local").setAppName("Filter")|
|sc = SparkContext(conf = conf)|
|lines = sc.textFile("hdfs://localhost:54310/numeric_input.txt")|
|input_strings = sc.parallelize(["Hello World", "Hi"])|
|splitted_strings = input_strings.map(lambda line:line.split(" ")).collect()|
|for str in splitted_strings:|
|splitted_strings = input_strings.flatMap(lambda line:line.split(" ")).collect()|
|for str in splitted_strings:|
Same file you can see in Notepad++ as follows.
We are going to run above file with the help of following command
and it gives us the following output
As you can see from above two screenshots, map() output gives the output RDD in lists, whereas flatMap() return the resultant RDD in one single format i.e. without the list
If we look into details, the output of map() and flatMap() looks something like this
This clearly shows us the difference between map() and flatMap() transformations.
Hope this makes sense to you guys.
Thanks for having a read.
Suggestions are welcome. Please do share this in your network if you like it.
Till then, Cheers!