Apache Spark on Kubernetes

Series of related articles

2 min readMar 15, 2022

In my previous article we tested several Kubernetes distributions which could be used for local development on openEuler OS. There we found that only Minikube and Kind were working fine.

In this article we will execute Apache Spark jobs on our local Kubernetes cluster.

What is Apache Spark ?

Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters. It provides high-level APIs for Scala, Java, Python, and R, as well as an optimization engine that supports a generic computational graph for data analysis. It also supports a rich set of advanced tools, including Spark SQL for SQL and Dataframes, MLLib for machine learning, GraphX for graphics processing, and Spark Streaming for Streaming.

What is Kubernetes ?

Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. It groups containers that make up an application into logical units for easy management and discovery.

Apache Spark on Kubernetes

Apache Spark has quite detailed documentation how to deploy on Kubernetes. Another good resource is Jacek Laskowski book.

Let’s summarize the steps quickly!

Start minikube with minikube start
Tell docker command line tool to use Minikube’s Docker daemon with minikube -p minikube docker-env | source
Create Kubernetes resources

3.1. A Kubernetes namespace to isolate all other Kubernetes resources needed by Apache Spark

export K8S_NAMESPACE=“spark-on-k8s”
kubectl create namespace $K8S_NAMESPACE

3.2. A Kubernetes (cluster)rolebinding and a serviceaccount. Those are needed by the driver pod to be able to create executor pods.

export K8S_SERVICE_ACCOUNT_NAME=“spark-account-name”
export K8S_CLUSTER_ROLE=“spark-on-k8s-cluster-role”
kubectl create serviceaccount $K8S_SERVICE_ACCOUNT_NAME — namespace=$K8S_NAMESPACE
kubectl create clusterrolebinding $K8S_CLUSTER_ROLE — clusterrole=edit — serviceaccount={$K8S_NAMESPACE}:{$K8S_SERVICE_ACCOUNT_NAME} — namespace=$K8S_NAMESPACE

4. Download Apache Spark distribution from https://spark.apache.org/downloads.html, for example https://www.apache.org/dyn/closer.lua/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz

5. Extract the package in a folder, e.g.

mkdir ~/spark && tar zxvf spark-3.2.1-bin-hadoop3.2.tgz -C ~/spark

6. Export $SPARK_HOME environment variable:

export SPARK_HOME="$HOME/spark/spark-3.2.1-bin-hadoop3.2"

7. Now we are ready to execute the job — ./run-spark-pi-on-k8s.fish

If everything is fine you should see logs similar to the following ones:

Checking the logs of the driver with kubectl -n $K8S_NAMESPACE logs -l spark-role=driver --tail=-1 | grep roughlyshould print something like Pi is roughly 3.143515717578588

Congratulations!

To do the same with Kind you should change the currently active Kubernetes context with kubectl config set-context my-kind-cluster-name and execute steps 3–7 from above.

In the next article we will explore the integration between Apache Spark 3.3.0 (not yet released) with Volcano Kubernetes scheduler.

Apache Spark on Kubernetes

Series of related articles

What is Apache Spark ?

What is Kubernetes ?

Apache Spark on Kubernetes

Written by Martin Grigorov

No responses yet