How to use Aggragate Functions Part – 1

In PySpark, aggregating functions are used to compute summary statistics or perform aggregations on a DataFrame. These functions allow you to calculate metrics such as count, sum, average, maximum, minimum, etc. Here are some commonly used aggregating functions in PySpark, along with examples:

count():

The count() function returns the number of rows in a DataFrame.

from pyspark.sql.functions import count

count_df = df.select(count("column1"))

This example calculates the count of non-null values in “column1” and stores the result in the DataFrame count_df.

sum():

The sum() function calculates the sum of a numerical column in a DataFrame.

from pyspark.sql.functions import sum

sum_df = df.select(sum("column2"))

Here, the sum of values in “column2” is computed and stored in the DataFrame sum_df.

avg():

The avg() function computes the average value of a numerical column in a DataFrame.

from pyspark.sql.functions import avg

avg_df = df.select(avg("column3"))

This example calculates the average value of “column3” and stores it in the DataFrame avg_df.

min() and max():

The min() and max() functions return the minimum and maximum values of a column, respectively.

from pyspark.sql.functions import min, max

min_df = df.select(min("column4"))
max_df = df.select(max("column4"))

These examples calculate the minimum and maximum values of “column4” and store them in the DataFrames min_df and max_df, respectively.

sumDistinct():

The sumDistinct() function calculates the sum of distinct values in a column.

from pyspark.sql.functions import sumDistinct

distinct_sum_df = df.select(sumDistinct("column5"))

Here, the sum of distinct values in “column5” is computed and stored in the DataFrame distinct_sum_df.

These are just a few examples of aggregating functions in PySpark. There are many other functions available for various aggregation operations. Aggregating functions help you summarize and derive insights from your data by performing calculations on columns within a DataFrame.