Proof of Concept
Spark does not have an official docker image at dockerhub at the time of writing. We use gradiant/spark:2.4.0 image.
This image is build from a modified version of the dockerfile provided in the Apache Spark distribution (spark-2.4.0/kubernetes/dockerfiles/spark/), that includes hadoop native libraries for Alpine Linux and Kafka Libraries.
Receipt to try Spark with Kubernetes as Job Scheduler. We will use a local kubernetes deployment for testing (minikube).
minikube start --memory=4096 --cpus=3
Setting up a kubernetes serviceaccount with permissions to create pods and services:
kubectl create serviceaccount spark
Now kubernetes API is accessible at http://127.0.0.1:8001
We run a container as spark client and configure Kubernetes as master:
|docker run --rm -ti --net host gradiant/spark:2.4.0 spark-submit \|
--master k8s://http://127.0.0.1:8001 \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image=gradiant/spark:2.4.0-k8s \
--conf spark.kubernetes.executor.request.cores=0.2 \
--executor-memory 500M \
Integration with Jupyter Notebooks
To use Spark on K8s inside a jupyter notebook we need to:
- Add pyspark to sys.path. We can do it in runtime using findspark.
- configure spark to be deployed in client mode : client-mode for Spark 2.4.0.