Proof of Concept
Spark does not have an official docker image at dockerhub at the time of writing. We use gradiant/spark:2.4.0 image.
This image is build from a modified version of the dockerfile provided in the Apache Spark distribution (spark-2.4.0/kubernetes/dockerfiles/spark/), that includes hadoop native libraries for Alpine Linux and Kafka Libraries.
Receipt to try Spark with Kubernetes as Job Scheduler. We will use a local kubernetes deployment for testing (minikube).
minikube start --memory=4096 --cpus=3 |
---|
Setting up a kubernetes serviceaccount with permissions to create pods and services:
kubectl create serviceaccount spark |
---|
Now kubernetes API is accessible at http://127.0.0.1:8001
We run a container as spark client and configure Kubernetes as master:
docker run --rm -ti --net host gradiant/spark:2.4.0 spark-submit \ --master k8s://http://127.0.0.1:8001 \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.instances=2 \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --conf spark.kubernetes.container.image=gradiant/spark:2.4.0-k8s \ --conf spark.kubernetes.executor.request.cores=0.2 \ --executor-memory 500M \ $SPARK_HOME/examples/jars/spark-examples_2.11-2.4.0.jar 100 |
---|
Integration with Jupyter Notebooks
To use Spark on K8s inside a jupyter notebook we need to:
- Add pyspark to sys.path. We can do it in runtime using findspark.
- configure spark to be deployed in client mode : client-mode for Spark 2.4.0.