Spark RDD Action

Estimated reading: 2 minutes 337 views

count()

rdd1 = sc.parallelize([1,2,3,4,5,3,2])

Action count() returns the number of elements in RDD.

				
					rdd1.count()
				
			

collect()

rdd1 = sc.parallelize([1,2,3,4,5,3,2])

The action collect() is the common and simplest operation that returns our entire RDDs content to driver program. The application of collect() is unit testing where the entire RDD is expected to fit in memory. As a result, it makes easy to compare the result of RDD with the expected result.
Action Collect() had a constraint that all the data should fit in the machine, and copies to the driver.

				
					rdd1.collect()
				
			

take(n)

rdd1 = sc.parallelize([1,2,3,4,5,3,2])

The action take(n) returns n number of elements from RDD. It tries to cut the number of partition it accesses, so it represents a biased collection. We cannot presume the order of the elements.

For example, consider RDD {1, 2, 2, 3, 4, 5, 5, 6} in this RDD “take (4)” will give result { 2, 2, 3, 4}

				
					rdd1.take(2)
				
			

top()

rdd1 = sc.parallelize([1,2,3,4,5,3,2])

If ordering is present in our RDD, then we can extract top elements from our RDD using top(). Action top() use default ordering of data.

				
					rdd1.top()
				
			

top()

rdd1 = sc.parallelize([1,2,3,4,5,3,2])

If ordering is present in our RDD, then we can extract top elements from our RDD using top(). Action top() use default ordering of data.

				
					rdd1.top()
				
			

countByValue

rdd1 = sc.parallelize([1,2,3,4,5,3,2])

Return Map[T,Long] key representing each unique value in dataset and value represents count each value present.

				
					rdd1.countByValue()
				
			

min

rdd1 = sc.parallelize([1,2,3,4,5,3,2])

Return the minimum value from the dataset.

				
					rdd1.min()
				
			

max

rdd1 = sc.parallelize([1,2,3,4,5,3,2])

Return the maximum value from the dataset.

				
					rdd1.max()
				
			

Leave a Comment

CONTENTS