Spark RDD Transformation part 1
textFile
using Text File we can read the Data as RDD.
data=sc.textFile("/FileStore/tables/test_first.txt") # RDD
#data.collect() #Action
map
Spark map() is a transformation operation that is used to apply the transformation on every element of RDD and finally returns a new RDD respectively.
mapFile = data.map(lambda line: line.split(" "))
mapFile.collect()
flatmap
Spark flatMap() transformation flattens the RDD column after applying the function on every element and returns a new RDD respectively.
mapFile = data.flatMap(lambda line: line.split(" "))
mapFile.collect()
Filter
RDD.filter() method returns an RDD with those elements which pass a filter condition (function) that is given as argument to the method.
mapFile = data.flatMap(lambda line: line.split(" ")).filter(lambda value : value=="spark")
mapFile.collect()
union
it is used to combile two or more RDD ‘s , but it may contains duplicate.
Its simplest set operation.
rdd1.union(rdd2) which outputs a RDD which contains the data from both sources.
If the duplicates are present in the input RDD, output of union() transformation will contain duplicate also which can be fixed using distinct().
rdd1 = sc.parallelize(((1,"jan",2016),(3,"nov",2014),(16,"feb",2014)))
rdd2 = sc.parallelize(((5,"dec",2014),(17,"sep",2015)))
rdd3 = sc.parallelize(((6,"dec",2011),(1,"jan",2016)))
rddUnion = rdd1.union(rdd2).union(rdd3)
rddUnion.collect()
intersection
intersection(anotherrdd) returns the elements which are present in both the RDDs.
rdd1 = sc.parallelize(((1,"jan",2016),(3,"nov",2014),(16,"feb",2014)))
rdd2 = sc.parallelize(((5,"dec",2014),(17,"sep",2015),(1,"jan",2016)))
rdd3 = sc.parallelize(((6,"dec",2011),(1,"jan",2016)))
common = rdd1.intersection(rdd2)
common.collect()
distinct
it will remove the duplicate From The RDD
rdd1 = sc.parallelize(((1,"jan",2016),(3,"nov",2014),(16,"feb",2014),(3,"nov",2014)))
result = rdd1.distinct()
result.collect()