In brief Spark SQL window functions are special functions that allow you to perform calculations and aggregations on a specific group of rows within a table. Rather than applying aggregate function to the entire dataset. Furthermore, Window functions divide the data into partitions based on certain criteria, such as grouping by a column’s values.
Syntax: The syntax for using window functions in Spark SQL is as follows:
window_function() OVER (PARTITION BY partition_expression ORDER BY sort_expression rows_range_specification)
window_function(): The specific window function to apply (e.g., SUM, AVG, RANK, etc.).
PARTITION BY: Defines the partitioning scheme for the window. Rows with the same values in the partitioning expression will be grouped together.
ORDER BY: Specifies the sorting order within each partition.
rows_range_specification: Defines the window frame boundaries (e.g., ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW).
Ranking Functions:
ROW_NUMBER(): Assigns a unique number to each row within the window.
RANK(): Computes the rank of each row within the window, with ties receiving the same rank.
DENSE_RANK(): Similar to RANK(), but ranks consecutive values without gaps.
SELECT employee_id, salary, RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rank FROM employees
Aggregate Functions:
SUM(), AVG(), COUNT(), MIN(), MAX(): Computes the aggregate value over the window.
Example:
SELECT department_id, salary, AVG(salary) OVER (PARTITION BY department_id) AS avg_salary FROM employees
Lead and Lag Functions:
LEAD(column, offset): Accesses the value of a column in the next row within the window.
LAG(column, offset): Accesses the value of a column in the previous row within the window.
Example:
SELECT date, revenue, LAG(revenue, 1) OVER (ORDER BY date) AS previous_revenue FROM sales_data
Window Frame Specification:
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW: Includes all rows from the start of the partition to the current row.
ROWS BETWEEN x PRECEDING AND y FOLLOWING: Specifies a range of rows relative to the current row.
Example:
SELECT date, revenue, SUM(revenue) OVER (ORDER BY date ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS rolling_sum FROM sales_data
Statistical Functions:
STDDEV_POP(), VARIANCE(), CORR(), COVAR_POP(): Computes statistical metrics over the window.
Example:
SELECT department_id, salary, STDDEV_POP(salary) OVER (PARTITION BY department_id) AS salary_stddev FROM employees