In PySpark, RDD (Resilient Distributed Dataset) pair functions are specific operations designed to work with RDDs that consist of key-value pairs. These pair functions provide powerful operations for data transformation and aggregation based on keys.
Here are some commonly used RDD pair functions in PySpark:
reduceByKey(func):
Applies a reduction function to values of each key, resulting in a new RDD with unique keys and reduced values.
rdd = sc.parallelize([(1, 2), (2, 4), (1, 6), (2, 8)]) result = rdd.reduceByKey(lambda a, b: a + b) print(result.collect())
Output:
[(1, 8), (2, 12)]
groupByKey():
Groups the values of each key together, resulting in a new RDD with keys and an iterable collection of values.
rdd = sc.parallelize([(1, "apple"), (2, "banana"), (1, "orange")]) result = rdd.groupByKey() print(result.collect())
Output:
[(1, ['apple', 'orange']), (2, ['banana'])]
mapValues(func):
Applies a function to each value of the RDD, preserving the keys.
rdd = sc.parallelize([(1, 2), (2, 4), (3, 6)]) result = rdd.mapValues(lambda x: x * 2) print(result.collect())
Output:
[(1, 4), (2, 8), (3, 12)]
flatMapValues(func):
Applies a function to each value of the RDD, which returns an iterator of key-value pairs, flattening the results.
rdd = sc.parallelize([(1, "apple"), (2, "banana,grape"), (3, "orange")]) result = rdd.flatMapValues(lambda x: x.split(",")) print(result.collect())
Output:
[(1, 'apple'), (2, 'banana'), (2, 'grape'), (3, 'orange')]
keys():
Returns a new RDD with only the keys from the original RDD.
rdd = sc.parallelize([(1, 2), (2, 4), (3, 6)]) result = rdd.keys() print(result.collect())
Output:
[1, 2, 3]
values():
Returns a new RDD with only the values from the original RDD.
rdd = sc.parallelize([(1, 2), (2, 4), (3, 6)]) result = rdd.values() print(result.collect())
Output:
[2, 4, 6]
sortByKey(ascending=True)
Sorts the RDD by keys in ascending or descending order.
rdd = sc.parallelize([(3, 'apple'), (1, 'banana'), (2, 'orange')]) result = rdd.sortByKey() print(result.collect())
Output:
[(1, 'banana'), (2, 'orange'), (3, 'apple')]
join(other, numPartitions=None)
Performs an inner join with another RDD based on their common keys.
rdd1 = sc.parallelize([(1, 'apple'), (2, 'banana')]) rdd2 = sc.parallelize([(1, 'red'), (2, 'yellow')]) result = rdd1.join(rdd2) print(result.collect())
Output:
[(1, ('apple', 'red')), (2, ('banana', 'yellow'))]
Example: rdd1.join(rdd2) joins RDDs based on common keys.
leftOuterJoin(other, numPartitions=None)
Performs a left outer join with another RDD based on their common keys, keeping all the keys from the left RDD.
rdd1 = sc.parallelize([(1, 'apple'), (2, 'banana'), (3, 'orange')]) rdd2 = sc.parallelize([(1, 'red'), (4, 'yellow')]) result = rdd1.leftOuterJoin(rdd2) print(result.collect())
Output:
[(1, ('apple', 'red')), (2, ('banana', None)), (3, ('orange', None))]
Example: rdd1.leftOuterJoin(rdd2) performs a left outer join on RDDs.
rightOuterJoin(other, numPartitions=None)
Performs a right outer join with another RDD based on their common keys, keeping all the keys from the right RDD.
rdd1 = sc.parallelize([(1, 'apple'), (2, 'banana')]) rdd2 = sc.parallelize([(1, 'red'), (3, 'orange')]) result = rdd1.rightOuterJoin(rdd2) print(result.collect())
Output:
[(1, ('apple', 'red')), (3, (None, 'orange'))]
Example: rdd1.rightOuterJoin(rdd2) performs a right outer join on RDDs.
cogroup(other, numPartitions=None)
Groups values of both RDDs with the same key, resulting in a new RDD with keys and a tuple of value iterables.
rdd1 = sc.parallelize([(1, 'apple'), (2, 'banana'), (1, 'orange')]) rdd2 = sc.parallelize([(1, 'red'), (2, 'yellow'), (2, 'green')]) result = rdd1.cogroup(rdd2) print(result.collect())
Output:
[(1, (['apple', 'orange'], ['red'])), (2, (['banana'], ['yellow', 'green']))]
Example: rdd1.cogroup(rdd2) groups values of both RDDs by key.
These pair functions allow you to perform various operations on RDDs that contain key-value pairs, enabling advanced data transformation and aggregation tasks. They provide a powerful way to work with data in a distributed manner and leverage the parallel processing capabilities of Spark.