In brief Spark SQL window functions are special functions that allow you to perform calculations and aggregations on a specific group of rows within a table. Rather than applying aggregate function to the entire dataset. Furthermore, Window functions divide the data into partitions based on certain criteria, such as grouping by a column’s values.

Syntax: The syntax for using window functions in Spark SQL is as follows:

window_function() OVER (PARTITION BY partition_expression ORDER BY sort_expression rows_range_specification)

window_function(): The specific window function to apply (e.g., SUM, AVG, RANK, etc.).
PARTITION BY: Defines the partitioning scheme for the window. Rows with the same values in the partitioning expression will be grouped together.
ORDER BY: Specifies the sorting order within each partition.
rows_range_specification: Defines the window frame boundaries (e.g., ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW).

Ranking Functions:

ROW_NUMBER(): Assigns a unique number to each row within the window.
RANK(): Computes the rank of each row within the window, with ties receiving the same rank.
DENSE_RANK(): Similar to RANK(), but ranks consecutive values without gaps.

SELECT employee_id, salary, RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rank
FROM employees

Aggregate Functions:

SUM(), AVG(), COUNT(), MIN(), MAX(): Computes the aggregate value over the window.

SELECT department_id, salary, AVG(salary) OVER (PARTITION BY department_id) AS avg_salary
FROM employees

Lead and Lag Functions:

LEAD(column, offset): Accesses the value of a column in the next row within the window.
LAG(column, offset): Accesses the value of a column in the previous row within the window.

SELECT date, revenue, LAG(revenue, 1) OVER (ORDER BY date) AS previous_revenue
FROM sales_data

Window Frame Specification:

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW: Includes all rows from the start of the partition to the current row.
ROWS BETWEEN x PRECEDING AND y FOLLOWING: Specifies a range of rows relative to the current row.

SELECT date, revenue, SUM(revenue) OVER (ORDER BY date ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS rolling_sum
FROM sales_data

Statistical Functions:

STDDEV_POP(), VARIANCE(), CORR(), COVAR_POP(): Computes statistical metrics over the window.

SELECT department_id, salary, STDDEV_POP(salary) OVER (PARTITION BY department_id) AS salary_stddev
FROM employees