Sorting data in PySpark DataFrame can be done using the sort()
or orderBy()
methods. Both methods are used to sort the DataFrame based on one or more columns.
Here’s an example of sorting data in ascending order using the sort() method:
sorted_df = df.sort("column1")
In this example, the DataFrame df is sorted in ascending order based on “column1”. The resulting DataFrame sorted_df will have the rows sorted in ascending order of “column1”.
To sort the data in descending order, you can use the desc()
function:
from pyspark.sql.functions import desc sorted_df = df.sort(desc("column1"))
In this case, the DataFrame df is sorted in descending order based on “column1”.
You can also sort the data based on multiple columns by passing them as separate arguments to the sort()
or orderBy()
methods:
sorted_df = df.sort("column1", "column2")
This example sorts the DataFrame df first by “column1” and then by “column2”, both in ascending order.
You can also specify the sort order for each column individually:
sorted_df = df.sort(df.column1.asc(), df.column2.desc())
In this example, “column1” is sorted in ascending order and “column2” in descending order.
Both sort()
and orderBy()
methods provide flexibility in sorting data based on one or more columns and ascending or descending order. You can choose the appropriate method based on your preference and use case.