Here are some advanced aggregate functions in PySpark with examples:

groupBy() and agg():

The groupBy() function is used to group data based on one or more columns, and the agg() function is used to perform aggregations on those groups.

from pyspark.sql.functions import sum, avg

group_df = df.groupBy("category").agg(sum("quantity").alias("total_quantity"), avg("price").alias("avg_price"))

In this example, the DataFrame df is grouped by the “category” column, and the sum of “quantity” and average of “price” are calculated for each group. The results are stored in the DataFrame group_df with the aliases “total_quantity” and “avg_price”.

pivot():

The pivot() function is used to create a pivot table, where values from one column are transformed into multiple columns.

pivot_df = df.groupBy("category").pivot("year").sum("sales")

Here, the DataFrame df is first grouped by the “category” column, and then a pivot table is created with the “year” column values as separate columns. The sum of “sales” is calculated for each category and year combination, resulting in the DataFrame pivot_df.

Window Functions:

Window functions allow you to perform calculations on a specific window of rows within a DataFrame.

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

window_spec = Window.partitionBy("category").orderBy("price")
ranked_df = df.withColumn("rank", row_number().over(window_spec))

In this example, a window specification is defined based on the “category” column and ordering by the “price” column. The row_number() function assigns a rank to each row within the window, resulting in the DataFrame ranked_df with an additional “rank” column.

collect_list() and collect_set():

The collect_list() and collect_set() functions are used to aggregate values into a list or set, respectively.

from pyspark.sql.functions import collect_list, collect_set

list_df = df.groupBy("category").agg(collect_list("product").alias("products_list"))
set_df = df.groupBy("category").agg(collect_set("product").alias("products_set"))

In these examples, the DataFrame df is grouped by the “category” column, and the distinct “product” values for each category are collected into a list and set, respectively. The results are stored in the DataFrames list_df and set_df.

These are some advanced aggregate functions in PySpark that provide powerful capabilities for data summarization and analysis. They allow you to perform complex aggregations, create pivot tables, calculate rankings within windows, and aggregate values into lists or sets.