In PySpark, there are multiple ways to filter data in a DataFrame. Here are some common approaches:

Using the filter() or where() methods:

Both the filter() and where() methods allow you to apply filtering operations on a DataFrame based on specified conditions.

filtered_df = df.filter(df.column1 > 10)

filtered_df = df.where(df.column1 > 10)

In the above examples, the DataFrame df is filtered to include only the rows where the value of “column1” is greater than 10. The resulting DataFrame filtered_df will contain those rows.

Using SQL-like syntax with selectExpr():

The selectExpr() method allows you to specify SQL-like expressions for filtering data in a DataFrame.

filtered_df = df.selectExpr("*").where("column1 > 10")

In this example, the DataFrame df is filtered using the SQL expression “column1 > 10”.

Using the when() and otherwise() functions:

You can use the when() and otherwise() functions in combination with the select() method to perform conditional filtering.

from pyspark.sql.functions import when

filtered_df = df.select("*").where(when(df.column1 > 10, "high").otherwise("low") == "high")

Here, the when() function checks if “column1” is greater than 10, and the otherwise() function sets the value as “high” if the condition is true. The resulting DataFrame filtered_df will contain only the rows where “column1” is greater than 10.

Using column expressions with functions:

PySpark provides a wide range of built-in functions that you can use to filter data based on specific conditions.

from pyspark.sql.functions import col

filtered_df = df.filter(col("column1").contains("abc"))

In this example, the DataFrame df is filtered to include only the rows where “column1” contains the substring “abc”. The col() function is used to reference the column within the filtering condition.

These are some of the ways to filter data in PySpark. You can choose the approach that best suits your requirements, whether it’s using the filter() or where() methods, SQL-like expressions, conditional functions, or column expressions with built-in functions. Filtering allows you to extract specific subsets of data from a DataFrame based on your analysis or processing needs.