In PySpark, the udf() function (User-Defined Function) is used to define custom functions that can be applied to DataFrame columns. It allows you to create and use your own functions that are not available in the built-in Spark SQL functions.

The udf() function takes two parameters: the function to be defined and the return type of the function. It returns a Column object that represents the user-defined function.

How to use udf() in PySpark:

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

spark = SparkSession.builder.getOrCreate()

# Create a DataFrame
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
df = spark.createDataFrame(data, ["Name", "Age"])

# Define a custom function using udf()
def add_greeting(name):
    return "Hello, " + name

# Register the custom function as a UDF
greeting_udf = udf(add_greeting, StringType())

# Apply the UDF to a DataFrame column
df_with_greeting = df.withColumn("Greeting", greeting_udf(df["Name"]))

# Show the DataFrame with the new column
df_with_greeting.show()

Output:

+-------+---+--------------+
|   Name|Age|   Greeting   |
+-------+---+--------------+
|  Alice| 25|Hello, Alice  |
|    Bob| 30|  Hello, Bob  |
|Charlie| 35|Hello, Charlie|
+-------+---+--------------+

In the example above, we have a DataFrame, df, with two columns, “Name” and “Age”. We define a custom function, add_greeting(), which takes a name as input and returns a greeting string. We use the udf() function to register this function as a UDF with the return type as StringType().

Then, we apply the UDF to the “Name” column of the DataFrame using the withColumn() function, creating a new column called “Greeting”. Finally, we display the DataFrame with the new column using the show() function.

By using udf(), you can extend the functionality of PySpark by defining your own functions and applying them to DataFrame columns. It provides flexibility and allows you to perform custom data transformations or calculations that are not available in the built-in Spark SQL functions.