Page tree
Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Current »

References:

PNDA-guide/streamingest

PNDA-guide/streamingest/topic-preparation

Confluent Kafka-Connect Deep Dive/Converters and Serialization

Overview

Kafka is the "front door" of PNDA, allowing the ingest of high-velocity data streams, distributing data to all interested consumers and decoupling data sources from data processing applications and platform clients.

Kafka Connect, an open source component of Kafka, is a framework for connecting Kafka with external systems such as databases, key-value stores, search indexes, and file systems. Using Kafka Connect you can use existing connector implementations for common data sources and sinks to move data into and out of Kafka.

We want to store the events that arrives to the kafka topics in HDFS datesets. Currently PNDA uses Apache Gobblin to ingest data streams from kafka to HDFS.

However Apache Gobblin has some disadvantages:

  • Apache Incubator Project.
  • Configuration complexity.
  • Job distribution through MapReduce mode disadvantages (from gobblin documentation): 
    • The MR mode suffers from problems such as large overhead mostly due to the overhead of submitting and launching a MR job and poor cluster resource usage.
    • The MR mode is also fundamentally not appropriate for real-time data ingestion given its batch nature.

NOTE: Recently Apache Gobblin added support of Apache Yarn as orchestrator for job distribution that may solves some of the previous disadvantages.

Here we explore kafka-connect as an alternative, a component of Kafka that would let us remove Yarn dependency and simplify configuration, management and scaling of the stream ingest jobs.

Getting Started

We will setup a local deployment of PNDA to test how kafka-connect works. Follow the Testing locally with microk8s' instructions, using the next yaml profile to enable kafka-connect related components:

# Enabling Kafka and Kafka-connect related components:
cp-zookeeper:
  enabled: true
cp-kafka:
  enabled: true
cp-schema-registry:
  enabled: true
schema-registry-ui:
  enabled: true
cp-kafka-connect:
  enabled: true
kafka-connect-ui:
  enabled: true
kafka-manager:
  enabled: true
 
 
# Enabling HDFS related components:
hdfs:
  enabled: true
 
# Disabling the rest of the external components
grafana:
  enabled: false
jupyterhub:
  enabled: false
cp-kafka-rest:
  enabled: false
cp-ksql-server:
  enabled: false
hbase:
  enabled: false
opentsdb:
  enabled: false
spark-standalone:
  enabled: false

Kafka-connect-ui can be accessed through the pnda-kafka-connect-ui service ip:

kubectl get service -o wide -n pnda pnda-kafka-connect-ui

Accessing kafka-connect-ui service ip through your browser:

Now we will configure some HDFS Sink Connectors to ingest events in kafka topics to HDFS.

Following instructions in PNDA-guide/topic-preparation we find different kind of topics:

Un-configured topic

TBD

Matching schema topic

When data being streamed to the platform already happends to adhere to the PNDA schema. 

Let's divide the test in two parts, depending in how the data is encoded in the kafka topic:

AvroEncoded

Reference: https://docs.confluent.io/current/connect/kafka-connect-hdfs/index.html

We use the kafka-avro-console-producer tool to produce some avro encoded data in a topic. The tool is available in the cp-schema-registry pod so we open a bash terminal inside the cp-schema-registry container of the pod:

kubectl exec -ti svc/pnda-cp-schema-registry -- /bin/bash
 
 
JMX_PORT=5559 kafka-avro-console-producer --broker-list $SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS --topic pnda.avro.test --property value.schema='{"namespace": "pnda.entity","type": "record","name": "event","fields": [ {"name": "timestamp", "type": "long"}, {"name": "source", "type": "string"}, {"name": "rawdata", "type": "bytes"}]}'

Then copy the next lines in the topic. If you don't respect the pnda avro schema, the kafka-avro-console-producer will exit with error.

{"timestamp":1567516179000,"source":"test","rawdata":"200"}
{"timestamp":1567516180000,"source":"test","rawdata":"300"}
{"timestamp":1567516181000,"source":"test","rawdata":"150"}
{"timestamp":1567516182000,"source":"test","rawdata":"100"}
{"timestamp":1567516183000,"source":"test","rawdata":"230"}
{"timestamp":1567516184000,"source":"test","rawdata":"250"}

CTRL+C to exit pnda-avro-consol-producer and 'exit' to close container terminal.

Go back to the  kafka-connect-ui and create a new HDFS Sink connector with the following config properties:


A new avro file containting $flush.size events should be writen to HDFS. dataset folder is named as the kafka topic, and can be further partitioned with partitioner classes (more info at hdfs-connector-doc).

You can check if files are correctly writen through "Utilities" > "Browse the Filesystem" in the HDFS-namenode ui:

kubectl get service pnda-hdfs-namenode

Default port is (50070).

JsonEncoded

Apparently Kafka-connect does not allow write in HDFS in Avro format if events are writen in topic with Json format. if you set flush_size to 1, It will correctly write a diferent avro file per event (does not make to much sense).

An issue describing this problem is written here.

Avro configured topic

TBD

  • No labels