Filtering data in PySpark allows you to extract specific rows from a DataFrame based on certain conditions. You can use the filter() or where() methods to apply filtering operations. Here’s an explanation of how to filter data in PySpark:

Using the filter() method:

The filter() method allows you to specify the filtering condition as a Boolean expression. It returns a new DataFrame that contains only the rows satisfying the specified condition

filtered_df = df.filter(df.column1 > 10)

In the above example, the DataFrame df is filtered to include only the rows where the value of “column1” is greater than 10. The resulting DataFrame filtered_df will contain those rows.

Using the where() method:

The where() method is an alternative to filter() and provides the same functionality. It also allows you to specify the filtering condition as a Boolean expression.

filtered_df = df.where(df.column2.isNull())

Here, the DataFrame df is filtered to include only the rows where the value of “column2” is null. The resulting DataFrame filtered_df will contain those rows.

Combining multiple conditions:

You can use logical operators like and, or, and not to combine multiple filtering conditions.

filtered_df = df.filter((df.column1 > 10) & (df.column2.isNull()))

In this example, the DataFrame df is filtered to include only the rows where “column1” is greater than 10 and “column2” is null.

Using functions for filtering:

PySpark provides a wide range of built-in functions that you can use for filtering. These functions can be applied to columns within the filtering condition.

from pyspark.sql.functions import col

filtered_df = df.filter(col("column1").contains("abc"))

Here, the DataFrame df is filtered to include only the rows where “column1” contains the substring “abc”. The col() function is used to reference the column within the filtering condition.