Flink will be introduced to PNDA as an alternative engine for data processing.
At a high level this involves:
The following section discusses the changes required to each PNDA component.
Flink jobs will be submitted as single jobs with no need to have a Flink cluster. More details of Flink YARN integration are here
Flink resources and any other dependencies will be hosted on the PNDA mirror. The mirror build script will need to include these in the appropriate mirror section.
A Flink history server will be run on one of the PNDA manager nodes.
Flink client libraries will be installed on all nodes with a hadoop EDGE role.
Support will be added for deploying Flink components in application packages.
A Flink component plugin will be created that will run Flink applications in the same way as the Spark Streaming plugin currently does. A supervisor will be set up on the PNDA edge node that will call the Flink CLI to submit the job.
We will not include support for Flink in the Deployment manager application information service as this is currently under review, and the implementation is likely to change.
A health metric will be generated for Flink "hadoop.flink.health" which will show the health status of the Flink history server.
The presence of a running, healthy History Server will be verified by calling API “/joboverview” on the history server.
Flink application metrics will be gathered as described here and in the post http://mnxfst.tumblr.com/post/136539620407/receiving-metrics-from-apache-flink-applications and associated with the relevent application and component by prefixing with "application.kpi.<applicaton-name>.<component-name>".
Candidate metrics: https://github.com/apache/flink/tree/master/flink-metrics
The PNDA console dashboard page will be modified to include add Flink blocks under both Stream & Batch sections.
The Flink block will link to the history server UI.
The Flink block will link to a help pop-up with include help text and a link to the Flink documentation.
The Flink history server log will be aggregated by the PNDA logshipper.
Flink application logs will be aggregated by the PNDA logshipper by configuring the logshipper to collect logs from the YARN container folder on each node manager. To match spark streaming, this will gather the stdout and stderr logs and a flink specific log called flink.log that the user can configure their application to write to if they want to.
A utility library will be provided that offers scala functions for reading the PNDA master dataset into Flink objects/streams.
Any Kafka and Avro libraries required to read/write to Kafka from Flink will be supplied as part of PNDA.
Prototyping and exploration using Flink with Jupyter shows that the support which is there right now is not good enough. We will monitor this and include support when it is fit-for-purpose.
More details and open issues of Jupyter integration with Flink are here
An example application that demonstrates the benfits of Flink over Spark in that case - e.g. Stateful continuous processing (non-batch mode).
Sections of guide will need creating or updating to reference Flink
Expose Flink as a runner to Apache Beam, and move to Apache Beam as primary SDK for application development. This is something under review and could be an incremental step forward from basic Flink support (it doesn't need to be an either/or).