Working with date data in PySpark involves using various functions provided by the pyspark.sql.functions module. These functions allow you to perform operations on date columns, extract specific date components, and manipulate dates. Here are some commonly used date-related functions in PySpark:

current_date():

Returns the current date as a date column.

from pyspark.sql.functions import current_date

df = df.withColumn("current_date", current_date())

This adds a new column named “current_date” to the DataFrame df containing the current date for each row.

current_timestamp():

Returns the current timestamp as a timestamp column.

from pyspark.sql.functions import current_timestamp

df = df.withColumn("current_timestamp", current_timestamp())

This adds a new column named “current_timestamp” to the DataFrame df containing the current timestamp for each row.

date_format():

Formats a date column as a string using a specified date format.

from pyspark.sql.functions import date_format

df = df.withColumn("formatted_date", date_format(df.date_column, "yyyy-MM-dd"))

This adds a new column named “formatted_date” to the DataFrame df with the date column formatted in the specified format.

year(), month(), day(), hour(), minute(), second():

Extracts the corresponding date component from a date or timestamp column.

from pyspark.sql.functions import year, month, day

df = df.withColumn("year", year(df.date_column))
df = df.withColumn("month", month(df.date_column))
df = df.withColumn("day", day(df.date_column))

These functions extract the year, month, and day components from the date column, respectively.

date_add(), date_sub():

Adds or subtracts a specified number of days from a date column.

from pyspark.sql.functions import date_add, date_sub

df = df.withColumn("next_week", date_add(df.date_column, 7))
df = df.withColumn("last_week", date_sub(df.date_column, 7))

These functions add or subtract the specified number of days from the date column, creating new columns with the modified dates.

datediff():

Calculates the difference in days between two date columns.

from pyspark.sql.functions import datediff

df = df.withColumn("days_diff", datediff(df.date_column1, df.date_column2))

This computes the number of days between “date_column1” and “date_column2” and creates a new column with the result.

These are just a few examples of date-related functions available in PySpark.

You can explore more functions in the official PySpark documentation (https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions) to manipulate and analyze date data efficiently.