In PySpark, RDD (Resilient Distributed Dataset) pair functions are specific operations designed to work with RDDs that consist of key-value pairs. These pair functions provide powerful operations for data transformation and aggregation based on keys.

Here are some commonly used RDD pair functions in PySpark:

reduceByKey(func):

Applies a reduction function to values of each key, resulting in a new RDD with unique keys and reduced values.

rdd = sc.parallelize([(1, 2), (2, 4), (1, 6), (2, 8)])
result = rdd.reduceByKey(lambda a, b: a + b)
print(result.collect())

Output:

[(1, 8), (2, 12)]

groupByKey():

Groups the values of each key together, resulting in a new RDD with keys and an iterable collection of values.

rdd = sc.parallelize([(1, "apple"), (2, "banana"), (1, "orange")])
result = rdd.groupByKey()
print(result.collect())

Output:

[(1, ['apple', 'orange']), (2, ['banana'])]

mapValues(func):

Applies a function to each value of the RDD, preserving the keys.

rdd = sc.parallelize([(1, 2), (2, 4), (3, 6)])
result = rdd.mapValues(lambda x: x * 2)
print(result.collect())

Output:

[(1, 4), (2, 8), (3, 12)]

flatMapValues(func):

Applies a function to each value of the RDD, which returns an iterator of key-value pairs, flattening the results.

rdd = sc.parallelize([(1, "apple"), (2, "banana,grape"), (3, "orange")])
result = rdd.flatMapValues(lambda x: x.split(","))
print(result.collect())

Output:

[(1, 'apple'), (2, 'banana'), (2, 'grape'), (3, 'orange')]

keys():

Returns a new RDD with only the keys from the original RDD.

rdd = sc.parallelize([(1, 2), (2, 4), (3, 6)])
result = rdd.keys()
print(result.collect())

Output:

[1, 2, 3]

values():

Returns a new RDD with only the values from the original RDD.

rdd = sc.parallelize([(1, 2), (2, 4), (3, 6)])
result = rdd.values()
print(result.collect())

Output:

[2, 4, 6]

sortByKey(ascending=True)

Sorts the RDD by keys in ascending or descending order.

rdd = sc.parallelize([(3, 'apple'), (1, 'banana'), (2, 'orange')])
result = rdd.sortByKey()
print(result.collect())

Output:

[(1, 'banana'), (2, 'orange'), (3, 'apple')]

join(other, numPartitions=None)

Performs an inner join with another RDD based on their common keys.

rdd1 = sc.parallelize([(1, 'apple'), (2, 'banana')])
rdd2 = sc.parallelize([(1, 'red'), (2, 'yellow')])
result = rdd1.join(rdd2)
print(result.collect())

Output:

[(1, ('apple', 'red')), (2, ('banana', 'yellow'))]

Example: rdd1.join(rdd2) joins RDDs based on common keys.

leftOuterJoin(other, numPartitions=None)

Performs a left outer join with another RDD based on their common keys, keeping all the keys from the left RDD.

rdd1 = sc.parallelize([(1, 'apple'), (2, 'banana'), (3, 'orange')])
rdd2 = sc.parallelize([(1, 'red'), (4, 'yellow')])
result = rdd1.leftOuterJoin(rdd2)
print(result.collect())

Output:

[(1, ('apple', 'red')), (2, ('banana', None)), (3, ('orange', None))]

Example: rdd1.leftOuterJoin(rdd2) performs a left outer join on RDDs.

rightOuterJoin(other, numPartitions=None)

Performs a right outer join with another RDD based on their common keys, keeping all the keys from the right RDD.

rdd1 = sc.parallelize([(1, 'apple'), (2, 'banana')])
rdd2 = sc.parallelize([(1, 'red'), (3, 'orange')])
result = rdd1.rightOuterJoin(rdd2)
print(result.collect())

Output:

[(1, ('apple', 'red')), (3, (None, 'orange'))]

Example: rdd1.rightOuterJoin(rdd2) performs a right outer join on RDDs.

cogroup(other, numPartitions=None)

Groups values of both RDDs with the same key, resulting in a new RDD with keys and a tuple of value iterables.

rdd1 = sc.parallelize([(1, 'apple'), (2, 'banana'), (1, 'orange')])
rdd2 = sc.parallelize([(1, 'red'), (2, 'yellow'), (2, 'green')])
result = rdd1.cogroup(rdd2)
print(result.collect())

Output:

[(1, (['apple', 'orange'], ['red'])), (2, (['banana'], ['yellow', 'green']))]


Example: rdd1.cogroup(rdd2) groups values of both RDDs by key.


These pair functions allow you to perform various operations on RDDs that contain key-value pairs, enabling advanced data transformation and aggregation tasks. They provide a powerful way to work with data in a distributed manner and leverage the parallel processing capabilities of Spark.