Take the work contributed to Red PNDA for containerisation and expand it to deliver a fully cloud-native PNDA.
PNDA Containerization Current State (red-pnda)
Red-pnda provisions a miniman set of PNDA components to enable developers writing apps targeted at the full PNDA stack. One of its last features is to deploy red-pnda as a set of docker containers to optimize resources compare to the monolithic Virtual Machine deployment.
A set of Dockerfiles are included in the red-pnda github repo to build the corresponding containers for each PNDA deployment unit. Dockerfiles currently admit a "version" build-arg to download the corresponding PNDA component release from its github repo.
From the identified PNDA Deployment units we Highlight the dockerized ones.
Plan
- Define a set of common rules for the creation of Dockerfile in each pnda component.
- Set uniform environment variable names for container configuration (e.g., hdfs namenode URI must be configured with the same environment variable name in all components).
- Integrate the corresponding Dockerfile for each deployment unit into its repo. Dockerfile must be modified to get source code directly from its repo itself (COPY instruction) and extract the version maybe with a git describe.
- Integrate Apache Spark with kubernetes (supported by Spark 2.3 release).
- Dataset Persistence in Cloud. Study cloud storage alternatives to HDFS (S3, minio for on-prem). Improve Data-mgmt to work with Cold/Warm storage paradigms (S3 vs Glacier ...).
- Separate Infrastructure Orchestration from Container Orchestration, i.e., use a DevOps Tool such as SaltStack or Ansible to deploy Infrastructure resources (AWS EC2, MaaS,) and configure Kubernetes master/workers, then use Kubernetes to deploy PNDA deployment units.