Docy

Spark RDD Transformation part 1

Estimated reading: 2 minutes 899 views

textFile

using Text File we can read the Data as RDD.

				
					data=sc.textFile("/FileStore/tables/test_first.txt") # RDD
    #data.collect() #Action
				
			

map

Spark map() is a transformation operation that is used to apply the transformation on every element of RDD and finally returns a new RDD respectively.

				
					mapFile = data.map(lambda line: line.split(" "))
	mapFile.collect()
				
			

flatmap

Spark flatMap() transformation flattens the RDD column after applying the function on every element and returns a new RDD respectively.

				
					mapFile = data.flatMap(lambda line: line.split(" "))
	mapFile.collect()
				
			

Filter

RDD.filter() method returns an RDD with those elements which pass a filter condition (function) that is given as argument to the method.

				
					mapFile = data.flatMap(lambda line: line.split(" ")).filter(lambda value : value=="spark")
	mapFile.collect()
				
			

union

it is used to combile two or more RDD ‘s , but it may contains duplicate.
Its simplest set operation.
rdd1.union(rdd2) which outputs a RDD which contains the data from both sources.
If the duplicates are present in the input RDD, output of union() transformation will contain duplicate also which can be fixed using distinct().

				
					rdd1 = sc.parallelize(((1,"jan",2016),(3,"nov",2014),(16,"feb",2014)))
	rdd2 = sc.parallelize(((5,"dec",2014),(17,"sep",2015)))
	rdd3 = sc.parallelize(((6,"dec",2011),(1,"jan",2016)))
	rddUnion = rdd1.union(rdd2).union(rdd3)
	rddUnion.collect()
				
			

intersection

intersection(anotherrdd) returns the elements which are present in both the RDDs.

				
					rdd1 = sc.parallelize(((1,"jan",2016),(3,"nov",2014),(16,"feb",2014))) 
	 rdd2 = sc.parallelize(((5,"dec",2014),(17,"sep",2015),(1,"jan",2016)))
	 rdd3 = sc.parallelize(((6,"dec",2011),(1,"jan",2016)))
	 common = rdd1.intersection(rdd2)  
	 common.collect()
				
			

distinct

it will remove the duplicate From The RDD

				
					rdd1 = sc.parallelize(((1,"jan",2016),(3,"nov",2014),(16,"feb",2014),(3,"nov",2014))) 
	result = rdd1.distinct()
	result.collect()
				
			

Leave a Comment

Share this Doc
CONTENTS