Spark repartition and coalesce

Estimated reading: 3 minutes 894 views

Spark Repartition() vs Coalesce() overview

Spark repartition() vs coalesce() – repartition() is used to increase or decrease the RDD, DataFrame, Dataset partitions whereas the coalesce() is used to only decrease the number of partitions in an efficient way.

RDD repartition

Spark RDD repartition() method is used to increase or decrease the partitions. The below example decreases the partitions from 10 to 4 by moving data from all partitions.

				
					
       rdd2 = rdd1.repartition(4)
	   println("Repartition size : "+rdd2.partitions.size)
	   rdd2.saveAsTextFile("/tmp/re-partition")

This yields output Repartition size : 4 and the repartition re-distributes the data(as shown below) from all partitions which is full shuffle leading to very expensive operation when dealing with billions and trillions of data.

RDD repartition

Spark RDD repartition() method is used to increase or decrease the partitions. The below example decreases the partitions from 10 to 4 by moving data from all partitions.

				
					
       rdd2 = rdd1.repartition(4)
	   println("Repartition size : "+rdd2.partitions.size)
	   rdd2.saveAsTextFile("/tmp/re-partition")

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.

RDD coalesce()

Spark RDD coalesce() is used only to reduce the number of partitions. This is optimized or improved version of repartition() where the movement of the data across the partitions is lower using coalesce.

				
					rdd3 = rdd1.coalesce(4)
  println("Repartition size : "+rdd3.partitions.size)
  rdd3.saveAsTextFile("/tmp/coalesce")

If you compared the below output with section 1, you will notice partition 3 has been moved to 2 and Partition 6 has moved to 5, resulting data movement from just 2 partitions.

DataFrame repartition()

Similar to RDD, the Spark DataFrame repartition() method is used to increase or decrease the partitions. The below example increases the partitions from 5 to 6 by moving data from all partitions.

				
					df2 = df.repartition(6)
println(df2.rdd.partitions.length)

And, even decreasing the partitions also results in moving data from all partitions. hence when you wanted to decrease the partition recommendation is to use coalesce()

DataFrame coalesce()

Spark DataFrame coalesce() is used only to decrease the number of partitions. This is an optimized or improved version of repartition() where the movement of the data across the partitions is fewer using coalesce.

				
					df3 = df.coalesce(2)
println(df3.rdd.partitions.length)

This yields output 2 and the resultant partition looks like

Partition 1 : 0 1 2 3 8 9 10 11
Partition 2 : 4 5 6 7 12 13 14 15 16 17 18 19
Scala
Since we are reducing 5 to 2 partitions, the data movement happens only from 3 partitions and it moves to remain 2 partitions.

Spark repartition and coalesce

Spark Repartition() vs Coalesce() overview

RDD repartition

RDD repartition

RDD coalesce()

DataFrame repartition()

DataFrame coalesce()

Leave a Comment Cancel reply

CONTENTS