Spark RDD Transformation part 2

Estimated reading: 2 minutes 869 views

sortByKey

it will sort the record using Key

				
					myRDD = [(1, 2582), (10, 3222), (4, 4190), (8, 2502), (2, 2537)]
	 values = []
	 data = sc.parallelize(myRDD)
	 #data.collect()
	 data.sortByKey().collect()

sortBy

sortBy() is used to sort the data by value efficiently in pyspark. It is a method available in rdd.

				
					#b = sc.parallelize([('t', 3,3),('b', 4,1),('c', 1,2)])
	bSorted = b.sortBy(lambda a: a[2])
	#bSorted.collect()
	bSorted.collect()

groupBy

Return an RDD of grouped items.

				
					rdd = sc.parallelize([1, 1, 2, 3, 5, 8])
    result = rdd.groupBy(lambda x: x % 2).collect()
    sorted([(x, sorted(y)) for (x, y) in result])

reduceByKey

PySpark reduceByKey() transformation is used to merge the values of each key using an associative reduce function on PySpark RDD.
It is a wider transformation as it shuffles data across multiple partitions and It operates on pair RDD (key/value pair).

				
					words = ["one","two","two","four","five","six","six","eight","nine","ten","three"]
	data = sc.parallelize(words)
	data = data.map(lambda w : (w,1)).reduceByKey(lambda x,y:x+y)
	data.collect()

join

It returns RDD with a pair of elements with the matching keys and all the values for that particular key.
In the following example, there are two pair of elements in two different RDDs.
After joining these two RDDs, we get an RDD with elements having matching keys and their values.

				
					data  =  sc.parallelize((('A',1),('b',2),('c',3)))
data2 = sc.parallelize((('A',4),('A',6),('b',7),('c',3),('c',8)))
res = data2.join(data)
res.collect()