Proof of Concept

Spark does not have an official docker image at dockerhub at the time of writing. We use gradiant/spark:2.4.0 image.
This image is build from a modified version of the dockerfile provided in the Apache Spark distribution (spark-2.4.0/kubernetes/dockerfiles/spark/), that includes hadoop native libraries for Alpine Linux and Kafka Libraries.

Receipt to try Spark with Kubernetes as Job Scheduler. We will use a local kubernetes deployment for testing (minikube).

minikube start --memory=4096 --cpus=3

Setting up a kubernetes serviceaccount with permissions to create pods and services:

kubectl create serviceaccount spark
kubectl create rolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default
# local proxy for Kubernetes API
kubectl proxy

Now kubernetes API is accessible at

We run a container as spark client and configure Kubernetes as master:

docker run --rm -ti --net host gradiant/spark:2.4.0 spark-submit \
--master k8s:// \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image=gradiant/spark:2.4.0-k8s \
--conf spark.kubernetes.executor.request.cores=0.2 \
--executor-memory 500M \
$SPARK_HOME/examples/jars/spark-examples_2.11-2.4.0.jar 100

Integration with Jupyter Notebooks

To use Spark on K8s inside a jupyter notebook we need to:

