Download Apache Spark: Go to the official Apache Spark website (https://spark.apache.org/downloads.html) and download the latest version of Spark.
Extract Spark: Once downloaded, extract the Spark package to a desired location on your Linux machine.
Configure Environment Variables:
Open Terminal and run the following command: nano ~/.bashrc
Add the following lines to the file
export SPARK_HOME=/path/to/spark export PYSPARK_PYTHON=/usr/bin/python3 export PATH=$SPARK_HOME/bin:$PATH
Save the file (press Ctrl + X, then Y, and Enter).
Refresh the Environment: Run the following command in Terminal to apply the changes to your current session: source ~/.bashrc
Verify the Setup: Open a new Terminal window and run pyspark to launch the PySpark shell. If it starts without errors, the setup is successful.
Testing the PySpark Installation:
- Open a new terminal or command prompt window.
- Run the following command to start the PySpark shell:
pyspark
- If everything is set up correctly, you should see the PySpark shell starting and a Python prompt (
>>>
) appearing. - You can test PySpark by running simple PySpark commands, such as creating RDDs or DataFrames and performing basic operations on them.