Basics of PySpark

Resilient Distributed Datasets (RDDs):

RDDs are the core data structure in PySpark. They represent an immutable distributed collection of objects that can be processed in parallel across a cluster. RDDs are fault-tolerant, meaning they can recover from failures and are designed for distributed processing.

Example:

#Creating an RDD from a list of numbers
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])

#Applying transformations on RDD
squared_rdd = rdd.map(lambda x: x ** 2)

#Performing an action on RDD
sum_of_squares = squared_rdd.reduce(lambda x, y: x + y)

DataFrames:

DataFrames are distributed collections of structured data, organized in columns. They provide a higher-level abstraction compared to RDDs and offer better optimization and query performance. DataFrames in PySpark are similar to tables in a relational database or data frames in pandas.

Example:

#Creating a DataFrame from a list of tuples
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
df = spark.createDataFrame(data, ["Name", "Age"])

#Performing operations on DataFrame
filtered_df = df.filter(df.Age > 30)
average_age = df.agg({"Age": "avg"}).collect()[0][0]

Datasets:

Datasets are a strongly-typed API available in PySpark, introduced in Spark 1.6. They provide the benefits of both RDDs and DataFrames, combining strong typing with the optimization and performance benefits of DataFrames. Datasets allow you to work with structured or unstructured data in a type-safe manner.

Example:

#Creating a Dataset from a list of tuples
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
df = spark.createDataFrame(data, ["Name", "Age"])
dataset = df.as[(str, int)]

#Applying transformations on Dataset
filtered_dataset = dataset.filter(lambda x: x[1] > 30)
average_age = dataset.selectExpr("avg(_2)").collect()[0][0]

These data structures in PySpark enable distributed and parallel processing of data, allowing you to perform various transformations and actions on large-scale datasets. They provide the foundation for data manipulation, analysis, and machine learning tasks in PySpark.

Resilient Distributed Datasets (RDDs):

Example:

DataFrames:

Example:

Datasets:

Example:

You may find these useful: