How to Perform Join (Self-Join, Cross-Join, Anti-Join) Operation – Part 2

In addition to the basic join operations (inner join, left join, right join, and full outer join), PySpark provides advanced join operations that offer more flexibility and control over the join process. Here are some advanced join operations in PySpark:

Cross Join:

A cross join, also known as a Cartesian join, combines every row from the first DataFrame with every row from the second DataFrame. It does not require a common column for the join.

cross_df = df1.crossJoin(df2)

In this example, df1 and df2 are cross-joined, resulting in the DataFrame cross_df containing all possible combinations of rows from both DataFrames.

Self-Join:

A self-join is a join operation where a DataFrame is joined with itself. It is useful when you want to compare or analyze data within the same DataFrame using different aliases.

self_join_df = df.alias("df1").join(df.alias("df2"), df1.common_column == df2.common_column, "inner")

Here, the DataFrame df is self-joined using different aliases “df1” and “df2” based on the “common_column”. The resulting DataFrame self_join_df will contain the matching rows from the self-join operation.

Join on Multiple Columns:

You can perform joins based on multiple columns by specifying a list of column names.

joined_df = df1.join(df2, ["column1", "column2"], "inner")

In this example, df1 and df2 are joined based on the columns “column1” and “column2”. The resulting DataFrame joined_df will include the rows where both columns have matching values.

Non-Equi Join:

A non-equi join allows you to join DataFrames using non-equality conditions, such as range or inequality comparisons.

non_equi_df = df1.join(df2, (df1.id == df2.id) & (df1.date <= df2.date), "inner")

Here, df1 and df2 are joined based on the condition where the “id” columns are equal and the “date” column in df1 is less than or equal to the “date” column in df2. The resulting DataFrame non_equi_df will contain the rows that satisfy the non-equality condition.

Anti Join:

An anti join is a type of join operation that returns only the rows from the left DataFrame that do not have a match in the right DataFrame based on a specified condition. In other words, it filters out the common rows and keeps only the non-matching rows.

Here’s an example of performing an anti join in PySpark:

anti_join_df = df1.join(df2, df1.common_column == df2.common_column, "left_anti")

In this example, df1 and df2 are anti-joined based on the “common_column” using the “left_anti” join type. The resulting DataFrame anti_join_df will contain only the rows from df1 that do not have a match in df2.

This type of join can be useful in scenarios where you want to find the non-matching records between two datasets or perform data filtering based on the absence of certain values in the right DataFrame.

Note that the “left_anti” join type is used when performing an anti join, indicating that the left DataFrame (df1 in this case) is the one from which non-matching rows are retained. Alternatively, you can also perform a right anti join using the “right_anti” join type to keep the non-matching rows from the right DataFrame.