How to Use union() Function in PySpark

In PySpark, the union() function is used to combine two Dataframes vertically, appending the rows of one Dataframe to another. It creates a new Dataframe that includes all the rows from both Dataframes.

The syntax of the union() function:

df1.union(df2)

Usage of union() in PySpark:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Create the first DataFrame
data1 = [("Alice", 25), ("Bob", 30)]
df1 = spark.createDataFrame(data1, ["Name", "Age"])

# Create the second DataFrame
data2 = [("Charlie", 35), ("Dave", 40)]
df2 = spark.createDataFrame(data2, ["Name", "Age"])

# Perform the union
df_union = df1.union(df2)
df_union.show()

Output:

+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
|   Dave| 40|
+-------+---+

In the example above, we have two Dataframes, df1 and df2, with the same schema. The union() function is applied to these Dataframes, which concatenates the rows from both DataFrames to create a new Dataframe, df_union. The resulting Dataframe contains all the rows from df1 and df2.

It’s important to note that the Dataframes being combined using union() should have the same schema (same column names and data types) for the operation to succeed.

If schemas are not the same it returns an error.

The union() function is useful when you want to vertically concatenate two Dataframes with the same schema. It is commonly used to combine data from different sources.

Note that the union() function removes duplicates based on all columns, so if there are duplicate rows in the two Dataframes, only one copy will be included in the combined Dataframe.

unionAll() Function:

The unionAll() function preserves all rows from both Dataframes, including duplicates. If there are duplicate rows in the two Dataframes, all copies will be included in the combined Dataframe.

# Using unionAll()
combined_df_all = df1.unionAll(df2)
combined_df_all.show()

Output:

+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
|   Dave| 40|
|    Eve| 45|
+-------+---+

Note: In newer versions of PySpark, the unionAll() function has been deprecated, and you can directly use union() without removing duplicates by setting the distinct parameter to False

combined_df_all = df1.union(df2).distinct(False)

That’s how you can use the union() and unionAll() functions in PySpark to combine the rows of two Dataframes or Datasets.

How to Use union() Function in PySpark

The syntax of the union() function:

Usage of union() in PySpark:

unionAll() Function:

You may find these useful: