Kafka is the "front door" of PNDA, allowing the ingest of high-velocity data streams, distributing data to all interested consumers and decoupling data sources from data processing applications and platform clients.
Kafka Connect, an open source component of Kafka, is a framework for connecting Kafka with external systems such as databases, key-value stores, search indexes, and file systems. Using Kafka Connect you can use existing connector implementations for common data sources and sinks to move data into and out of Kafka.
We want to store the events that arrives to the kafka topics in HDFS datesets. Currently PNDA uses Apache Gobblin to ingest data streams from kafka to HDFS.
However Apache Gobblin has some disadvantages:
- Apache Incubator Project.
- Configuration complexity.
- Job distribution through MapReduce mode disadvantages (from gobblin documentation):
- The MR mode suffers from problems such as large overhead mostly due to the overhead of submitting and launching a MR job and poor cluster resource usage.
- The MR mode is also fundamentally not appropriate for real-time data ingestion given its batch nature.
NOTE: Recently Apache Gobblin added support of Apache Yarn as orchestrator for job distribution that may solves some of the previous disadvantages.
Here we explore kafka-connect as an alternative, a component of Kafka that would let us remove Yarn dependency and simplify configuration, management and scaling of the stream ingest jobs.
We will setup a local deployment of PNDA to test how kafka-connect works. Follow the Testing locally with microk8s' instructions, using the next yaml profile to enable kafka-connect related components:
Kafka-connect-ui can be accessed through the pnda-kafka-connect-ui service ip:
Accessing kafka-connect-ui service ip through your browser:
Now we will configure some HDFS Sink Connectors to ingest events in kafka topics to HDFS.
Following instructions in PNDA-guide/topic-preparation we find different kind of topics:
Matching schema topic
When data being streamed to the platform already happends to adhere to the PNDA schema.
Let's divide the test in two parts, depending in how the data is encoded in the kafka topic:
We use the kafka-avro-console-producer tool to produce some avro encoded data in a topic. The tool is available in the cp-schema-registry pod so we open a bash terminal inside the cp-schema-registry container of the pod:
Then copy the next lines in the topic. If you don't respect the pnda avro schema, the kafka-avro-console-producer will exit with error.
CTRL+C to exit pnda-avro-consol-producer and 'exit' to close container terminal.
Go back to the kafka-connect-ui and create a new HDFS Sink connector with the following config properties:
A new avro file containting $flush.size events should be writen to HDFS. dataset folder is named as the kafka topic, and can be further partitioned with partitioner classes (more info at hdfs-connector-doc).
You can check if files are correctly writen through "Utilities" > "Browse the Filesystem" in the HDFS-namenode ui:
Default port is (50070).
Apparently Kafka-connect does not allow write in HDFS in Avro format if events are writen in topic with Json format. if you set flush_size to 1, It will correctly write a diferent avro file per event (does not make to much sense).
An issue describing this problem is written here.
Avro configured topic