In PySpark, RDDs provide two methods for changing the number of partitions: repartition() and coalesce(). These methods allow you to control the partitioning of your RDDs, which can be useful for optimizing data distribution and parallelism in your Spark jobs.


repartition() shuffles the data across the cluster and creates a new RDD with the specified number of partitions.
It is a wide transformation that incurs a full shuffle of the data, which can be an expensive operation.

rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
repartitioned_rdd = rdd.repartition(3)




coalesce() is an optimized version of repartition() that minimizes data movement by avoiding shuffling whenever possible.
It decreases the number of partitions of an RDD but tries to minimize data movement by leveraging the existing partitioning structure.
It is a narrow transformation that avoids a full shuffle, making it faster than repartition().

rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 5)
coalesced_rdd = rdd.coalesce(2)



When to use repartition() and coalesce():

Use repartition() when you need to increase or decrease the number of partitions and a full shuffle is acceptable or necessary. For example, when you want to evenly distribute the data across a larger or smaller cluster.

Use coalesce() when you want to decrease the number of partitions and avoid a full shuffle. It is more efficient than repartition() in scenarios where the data distribution is already relatively balanced, and you want to reduce the number of partitions.

Keep in mind that both repartition() and coalesce() result in a new RDD. So, it’s important to assign the transformed RDD to a new variable if you want to work with the modified partitioning.