How to Use withColumn() Function in PySpark

In PySpark, the withColumn() function is used to add a new column or replace an existing column in a Dataframe. It allows you to transform and manipulate data by applying expressions or functions to the existing columns.

The syntax of the withColumn() function:

df.withColumn(colName, col)

How to use withColumn() function in PySpark:

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit

spark = SparkSession.builder.getOrCreate()

# Create a DataFrame
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
df = spark.createDataFrame(data, ["Name", "Age"])

# Add a new column with a constant value
df_new = df.withColumn("Gender", lit("Female"))
df_new.show()

Output:

+-------+---+------+
|   Name|Age|Gender|
+-------+---+------+
|  Alice| 25|Female|
|    Bob| 30|Female|
|Charlie| 35|Female|
+-------+---+------+

Transforming existing columns:

from pyspark.sql.functions import col

# Add a new column by applying a transformation to an existing column
df_new = df.withColumn("Age_in_5_years", col("Age") + 5)
df_new.show()

Output:

+-------+---+--------------+
|   Name|Age|Age_in_5_years|
+-------+---+--------------+
|  Alice| 25|            30|
|    Bob| 30|            35|
|Charlie| 35|            40|
+-------+---+--------------+

Replacing an existing column:

from pyspark.sql.functions import when

# Replace the Age column based on a condition
df_new = df.withColumn("Age", when(col("Age") < 30, "Young").otherwise("Old"))
df_new.show()

Output:

+-------+-----+
|   Name|  Age|
+-------+-----+
|  Alice|Young|
|    Bob|Young|
|Charlie|  Old|
+-------+-----+

In the examples above, we create a DataFrame and then use the withColumn() function to add a new column, transform existing columns, or replace an existing column.

We can provide the name of the new column as a string and the corresponding expression or value for the column. The modified DataFrame is returned, and we can further manipulate or perform operations on it.

If you wish to rename an existing column, then you should use withColumnRenamed() Function.

Note that the withColumn() function does not modify the original DataFrame; it creates a new DataFrame with the desired changes.

How to Use withColumn() Function in PySpark

The syntax of the withColumn() function:

How to use withColumn() function in PySpark:

Transforming existing columns:

Replacing an existing column:

You may find these useful: