Apache Spark -SNAPSHOT on Kubernetes
Series of related articles
To test the new integration of Apache Spark with Volcano Kubernetes scheduler that will be released with Spark v3.3.0 in the next few months we will need to build a Spark distribution and a Docker image locally.
Let’s see what it takes to do it!
- The first step is to checkout the Spark source:
git clone --branch branch-3.3 https://github.com/apache/spark.git
2. Build a distribution:
./dev/make-distribution.sh --tgz --name with-volcano -Pkubernetes,volcano,hadoop-3
This command will produce a distribution with a name
spark-3.3.0-SNAPSHOT-bin-with-volcano.tgz. Uncompressing it will produce a folder with a structure similar to one after extracting an official distribution.
SPARK_HOME environment variable pointing to this folder, e.g.:
4. Build a Docker image
Note: Here we use
openeuler/openeuler:20.03-lts-sp3 as a base Docker image. Any other ARM64 Linux image would work too!
The successful execution of this command will produce the Docker image that we will use for the Kubernetes pods:
$ docker imagesREPOSITORY TAG IMAGE ID CREATED SIZEspark 3.3.0-SNAPSHOT-scala_2.12–11-jre-slim df51f38e5c73 18 seconds ago 542MB
There are three changes:
- Line 8
2. Line 16
3. And line 38
With these steps we are ready to make use of the new Volcano integration! Stay tuned for the next article!
Update (March 24 2022): It seems
-Pvolcano Maven profile won’t be enabled for Apache Spark 3.3.0 release! That means one will need to build Spark locally from sources to be able to use the Volcano integration!