Apache Spark -SNAPSHOT on Kubernetes

Series of related articles

2 min readMar 18, 2022

In my previous article I’ve shown you how to deploy Apache Spark jobs on local Kubernetes cluster using an official release of Apache Spark and an official Docker image.

To test the new integration of Apache Spark with Volcano Kubernetes scheduler that will be released with Spark v3.3.0 in the next few months we will need to build a Spark distribution and a Docker image locally.

Let’s see what it takes to do it!

The first step is to checkout the Spark source:

git clone --branch branch-3.3 https://github.com/apache/spark.git

2. Build a distribution:

./dev/make-distribution.sh --tgz --name with-volcano -Pkubernetes,volcano,hadoop-3

This command will produce a distribution with a name spark-3.3.0-SNAPSHOT-bin-with-volcano.tgz. Uncompressing it will produce a folder with a structure similar to one after extracting an official distribution.

3. Set SPARK_HOME environment variable pointing to this folder, e.g.:

export SPARK_HOME=/path/to/spark-3.3.0-SNAPSHOT-bin-with-volcano

4. Build a Docker image

Note: Here we use openeuler/openeuler:20.03-lts-sp3 as a base Docker image. Any other ARM64 Linux image would work too!

bash ./create-docker-image.bash

The successful execution of this command will produce the Docker image that we will use for the Kubernetes pods:

$ docker imagesREPOSITORY      TAG                         IMAGE ID     CREATED        SIZEspark 3.3.0-SNAPSHOT-scala_2.12–11-jre-slim df51f38e5c73 18 seconds ago 542MB

5. Let’s test the new distribution and Docker image by modifying run-spark-pi-on-k8s.bash from the previous article:

There are three changes:

Line 8

SPARK_HOME="/path/to/spark-3.3.0-SNAPSHOT-bin-with-volcano"

2. Line 16

CONTAINER_IMAGE="spark:3.3.0-SNAPSHOT-scala_2.12-11-jre-slim"

3. And line 38

local:///opt/spark/examples/jars/spark-examples_2.12-3.3.0-SNAPSHOT.jar

With these steps we are ready to make use of the new Volcano integration! Stay tuned for the next article!

Update (March 24 2022): It seems -Pvolcano Maven profile won’t be enabled for Apache Spark 3.3.0 release! That means one will need to build Spark locally from sources to be able to use the Volcano integration!

Apache Spark -SNAPSHOT on Kubernetes

Series of related articles

Written by Martin Grigorov

No responses yet