In PySpark, a broadcast variable is a read-only variable that can be efficiently shared across all nodes of a cluster.
It allows data to be sent to the worker nodes only once and then reused across multiple tasks, thereby minimizing network overhead and improving performance.
Here’s an example to illustrate the usage of broadcast variables in PySpark:
Define a broadcast variable:
# Importing the necessary PySpark modules from pyspark.sql import SparkSession from pyspark.sql.functions import broadcast # Creating a SparkSession spark = SparkSession.builder.getOrCreate() # Creating a sample DataFrame data = [("John", 25), ("Jane", 30), ("Dave", 35) ] df1 = spark.createDataFrame(data, ["Name", "Age"]) # Creating a lookup DataFrame lookup_data = [("John", "Engineer"), ("Jane", "Doctor"), ("Dave", "Teacher") ] df2 = spark.createDataFrame(lookup_data, ["Name", "Profession"]) # Broadcasting the lookup DataFrame broadcast_df2 = broadcast(df2) # Performing a join operation using the broadcast variable result_df = df1.join(broadcast_df2, "Name", "inner") # Displaying the result result_df.show()
+----+---+----------+ |Name|Age|Profession| +----+---+----------+ |John| 25| Engineer| |Jane| 30| Doctor| |Dave| 35| Teacher| +----+---+----------+
In the example above, we have two Dataframes: df1
and df2
. We want to perform a join operation based on the “Name” column. By using the broadcast()
function, we convert df2
into a broadcast variable called broadcast_df2
.
The broadcast variable is then used in the join operation (df1.join(broadcast_df2, "Name", "inner")
). By broadcasting df2
, it is efficiently shared across all worker nodes, and the join operation is performed without unnecessary data shuffling.
Broadcasting the Dataframe significantly reduces the amount of data transferred over the network, resulting in improved performance and reduced execution time, especially when dealing with large Dataframes or joining datasets.
It’s important to note that broadcast variables should be used with caution and only for small or medium-sized Dataframes. Using broadcast variables for large Dataframes can lead to memory issues.