Download Apache Spark: Go to the official Apache Spark website (https://spark.apache.org/downloads.html) and download the latest version of Spark.

Extract Spark: Once downloaded, extract the Spark package to a desired location on your Linux machine.

Configure Environment Variables:

Open Terminal and run the following command: nano ~/.bashrc

Add the following lines to the file

export SPARK_HOME=/path/to/spark
export PYSPARK_PYTHON=/usr/bin/python3
export PATH=$SPARK_HOME/bin:$PATH

Save the file (press Ctrl + X, then Y, and Enter).

Refresh the Environment: Run the following command in Terminal to apply the changes to your current session: source ~/.bashrc

Verify the Setup: Open a new Terminal window and run pyspark to launch the PySpark shell. If it starts without errors, the setup is successful.

Testing the PySpark Installation:

  • Open a new terminal or command prompt window.
  • Run the following command to start the PySpark shell:
pyspark
  • If everything is set up correctly, you should see the PySpark shell starting and a Python prompt (>>>) appearing.
  • You can test PySpark by running simple PySpark commands, such as creating RDDs or DataFrames and performing basic operations on them.