How to Install Apache Spark on Ubuntu 23

Welcome to our tutorial on setting up Apache Spark on Ubuntu 23! Whether you’re a data enthusiast or a budding data engineer, Apache Spark is a powerful open-source framework that can supercharge your big data processing tasks. In this guide, we’ll walk you through the installation process, breaking it down into simple steps with examples for newcomers.

Prerequisites

Before we dive into the installation, make sure you have the following prerequisites:

Ubuntu 23: Ensure you have a clean installation of Ubuntu 23.
Java Installation: Apache Spark on Ubuntu 23 requires Java to run. Install OpenJDK with the following command:

sudo apt update
sudo apt install openjdk-11-jdk

Step 1: Download Apache Spark on Ubuntu 23

Head over to the official Apache Spark download page and grab the latest version. You can use wget to download it directly to your terminal:

wget https://downloads.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz

Replace 3.5.1 with the appropriate version numbers.

Step 2: Extract the Archive

Once the download is complete, extract the Spark archive using the following command:

tar -xvzf spark-3.5.1-bin-hadoop3.tgz

Again, replace 3.5.1 with your version numbers.

Step 3: Move Spark to a Permanent Location

Move the extracted Spark directory to a location of your choice. For example, let’s move it to /opt/:

sudo mv spark-3.5.1-bin-hadoop3 /opt/spark

Step 4: Set Environment Variables

Open your .bashrc file to set up environment variables:

nano ~/.bashrc

Add the following lines at the end of the file:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_PYTHON=python3

Set Apache Spark on Ubuntu 23 Environment Variables

Save the file and execute:

source ~/.bashrc

Step 5: Test Your Installation

To ensure that Apache Spark is installed correctly, open a new terminal and run:

spark-shell

This will launch the Spark shell, and you should see the Spark logo with a prompt. You’re now ready to harness the power of Apache Spark on Ubuntu 23!

Check Apache Spark on Ubuntu 23 installation

Spark Context Web UI

The Spark context Web UI is available at http://server_ip:4040. You can access this URL in your web browser to monitor Spark application details, jobs, and stages.

If you want to ensure that your Apache Spark cluster remains active as a service on Ubuntu, you can use a combination of tools like systemd and cron jobs to monitor and restart the Spark processes if needed. Here’s a simple example:

Create a systemd Service:

Create a systemd service file for Apache Spark. Create a file, for example, spark.service, in the /etc/systemd/system/ directory:

[Unit]
Description=Apache Spark

[Service]
Type=simple
ExecStart=/opt/spark/sbin/start-master.sh
Restart=always
RestartSec=3

[Install]
WantedBy=default.target

Enable and Start the Service:

Run the following commands to enable and start the service:

sudo systemctl enable spark
sudo systemctl start spark

This will start the Spark master as a systemd service.

Create a script, for example, monitor_spark.sh, that checks if Spark is running and restarts it if needed:

#!/bin/bash

SPARK_PID=$(pgrep -f "org.apache.spark.deploy.master.Master")

if [ -z "$SPARK_PID" ]; then
    echo "Spark is not running. Restarting..."
    /opt/spark/sbin/start-master.sh
else
    echo "Spark is running (PID: $SPARK_PID)."
fi

Make the script executable:

chmod +x monitor_spark.sh
sh monitor_spark.sh

sh monitor_spark.sh
Spark is not running. Restarting...
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-sysnestor.out

systemctl status spark
● spark.service - Apache Spark
     Loaded: loaded (/etc/systemd/system/spark.service; enabled; preset: enabled)
     Active: activating (auto-restart) since Sun 2024-03-03 12:28:00 UTC; 292ms ago
    Process: 1552923 ExecStart=/opt/spark/sbin/start-master.sh (code=exited, status=0/SUCCESS)
   Main PID: 1552923 (code=exited, status=0/SUCCESS)
        CPU: 6.338s

Navigate to your Apache Spark web console at http://your-ip:8080/

Congratulations! You’ve successfully installed Apache Spark on Ubuntu 23. This guide aimed to make the process accessible for beginners, providing detailed steps and explanations. Feel free to explore Spark’s vast capabilities and integrate it into your data processing projects.

Remember, the world of big data awaits, and Apache Spark is your gateway to unlocking its full potential. Happy coding!