How to Use Distinct() Function in PySpark

In PySpark, the distinct() function is used to retrieve unique rows from a Dataframe. It returns a new Dataframe with distinct rows based on all the columns of the original Dataframe.

The syntax of the distinct() function:

df.distinct()

Usage of distinct() in PySpark:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Create a DataFrame
data = [("Alice", 25), ("Bob", 30), ("Alice", 25)]
df = spark.createDataFrame(data, ["Name", "Age"])

# Retrieve distinct rows
df_distinct = df.distinct()
df_distinct.show()

Output:

+-----+---+
| Name|Age|
+-----+---+
|  Bob| 30|
|Alice| 25|
+-----+---+

In the example, we have a Datframe with two columns, “Name” and “Age”. The distinct() function is applied to the Dataframe to retrieve only the unique rows. In this case, the duplicate row with the name “Alice” and age 25 is removed, and the resulting Dataframe, df_distinct, contains only the distinct rows.

It’s important to note that distinct() considers all columns of the DataFrame when determining uniqueness.

To find distinct values based on specific columns, you can use the dropDuplicates() function and provide the column names.

# Drop duplicates based on specific columns
df_distinct = df.dropDuplicates(["Name"])

The distinct() function is useful when you want to eliminate duplicate rows from your Dataframe and work with only unique records. It can help in data cleaning, deduplication, and exploratory data analysis tasks.

How to Use Distinct() Function in PySpark

The syntax of the distinct() function:

Usage of distinct() in PySpark:

You may find these useful: