Apache Spark on Kubernetes
Series of related articles
In my previous article we tested several Kubernetes distributions which could be used for local development on openEuler OS. There we found that only Minikube and Kind were working fine.
In this article we will execute Apache Spark jobs on our local Kubernetes cluster.
What is Apache Spark ?
Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters. It provides high-level APIs for Scala, Java, Python, and R, as well as an optimization engine that supports a generic computational graph for data analysis. It also supports a rich set of advanced tools, including Spark SQL for SQL and Dataframes, MLLib for machine learning, GraphX for graphics processing, and Spark Streaming for Streaming.
What is Kubernetes ?
Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. It groups containers that make up an application into logical units for easy management and discovery.
Apache Spark on Kubernetes
Apache Spark has quite detailed documentation how to deploy on Kubernetes. Another good resource is Jacek Laskowski book.
Let’s summarize the steps quickly!
- Start minikube with
minikube start
- Tell
docker
command line tool to use Minikube’s Docker daemon withminikube -p minikube docker-env | source
- Create Kubernetes resources
3.1. A Kubernetes namespace to isolate all other Kubernetes resources needed by Apache Spark
export K8S_NAMESPACE=“spark-on-k8s”
kubectl create namespace $K8S_NAMESPACE
3.2. A Kubernetes (cluster)rolebinding and a serviceaccount. Those are needed by the driver pod to be able to create executor pods.
export K8S_SERVICE_ACCOUNT_NAME=“spark-account-name”
export K8S_CLUSTER_ROLE=“spark-on-k8s-cluster-role”
kubectl create serviceaccount $K8S_SERVICE_ACCOUNT_NAME — namespace=$K8S_NAMESPACE
kubectl create clusterrolebinding $K8S_CLUSTER_ROLE — clusterrole=edit — serviceaccount={$K8S_NAMESPACE}:{$K8S_SERVICE_ACCOUNT_NAME} — namespace=$K8S_NAMESPACE
4. Download Apache Spark distribution from https://spark.apache.org/downloads.html, for example https://www.apache.org/dyn/closer.lua/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
5. Extract the package in a folder, e.g.
mkdir ~/spark && tar zxvf spark-3.2.1-bin-hadoop3.2.tgz -C ~/spark
6. Export $SPARK_HOME
environment variable:
export SPARK_HOME="$HOME/spark/spark-3.2.1-bin-hadoop3.2"
7. Now we are ready to execute the job — ./run-spark-pi-on-k8s.fish
If everything is fine you should see logs similar to the following ones:
Checking the logs of the driver with kubectl -n $K8S_NAMESPACE logs -l spark-role=driver --tail=-1 | grep roughly
should print something like Pi is roughly 3.143515717578588
Congratulations!
To do the same with Kind you should change the currently active Kubernetes context with kubectl config set-context my-kind-cluster-name
and execute steps 3–7 from above.
In the next article we will explore the integration between Apache Spark 3.3.0 (not yet released) with Volcano Kubernetes scheduler.