Apache Spark -SNAPSHOT on Kubernetes
Series of related articles
In my previous article I’ve shown you how to deploy Apache Spark jobs on local Kubernetes cluster using an official release of Apache Spark and an official Docker image.
To test the new integration of Apache Spark with Volcano Kubernetes scheduler that will be released with Spark v3.3.0 in the next few months we will need to build a Spark distribution and a Docker image locally.
Let’s see what it takes to do it!
- The first step is to checkout the Spark source:
git clone --branch branch-3.3 https://github.com/apache/spark.git
2. Build a distribution:
./dev/make-distribution.sh --tgz --name with-volcano -Pkubernetes,volcano,hadoop-3
This command will produce a distribution with a name spark-3.3.0-SNAPSHOT-bin-with-volcano.tgz
. Uncompressing it will produce a folder with a structure similar to one after extracting an official distribution.
3. Set SPARK_HOME
environment variable pointing to this folder, e.g.:
export SPARK_HOME=/path/to/spark-3.3.0-SNAPSHOT-bin-with-volcano
4. Build a Docker image
Note: Here we use openeuler/openeuler:20.03-lts-sp3
as a base Docker image. Any other ARM64 Linux image would work too!
bash ./create-docker-image.bash
The successful execution of this command will produce the Docker image that we will use for the Kubernetes pods:
$ docker imagesREPOSITORY TAG IMAGE ID CREATED SIZEspark 3.3.0-SNAPSHOT-scala_2.12–11-jre-slim df51f38e5c73 18 seconds ago 542MB
5. Let’s test the new distribution and Docker image by modifying run-spark-pi-on-k8s.bash from the previous article:
There are three changes:
- Line 8
SPARK_HOME="/path/to/spark-3.3.0-SNAPSHOT-bin-with-volcano"
2. Line 16
CONTAINER_IMAGE="spark:3.3.0-SNAPSHOT-scala_2.12-11-jre-slim"
3. And line 38
local:///opt/spark/examples/jars/spark-examples_2.12-3.3.0-SNAPSHOT.jar
With these steps we are ready to make use of the new Volcano integration! Stay tuned for the next article!
Update (March 24 2022): It seems -Pvolcano
Maven profile won’t be enabled for Apache Spark 3.3.0 release! That means one will need to build Spark locally from sources to be able to use the Volcano integration!