The PNDA ingest system has a storage capability that is based on extracting an event's source and an event's time stamp.
So at ingesting time, the system needs to be configured with sufficient information in order to be able to start a parser that can extract those fields for every event of a given topic.
For this reason, each topic needs to be associated with a converter class and some associated converter configuration information.
Now, because topics are transient, a level of indirection will be provided that ensure a unique persistent id (family_id) is always associated with the information needed to instantiate that parser.
In order to later consume data back from PNDA, this unique family_id and it's associated parser configuration will be not only be persisted across topic deletion but will also be stored in (the avro metadata header of) each dataset produced by the PNDA ingest system.
Output schema selection
First observation is that we have a requirement to be able to identify the data family associated with a dataset. This translates in a requirement for being able to extract uniformly the data family from a given dataset. For this reason we standardize on an avro serialization and store the data family (irrespective of the input serialization type) in the avro metadata header: pnda.family_id with as value the family id that matched the configuration of this topic producing the input data.
This proposal consists in ensuring an avro envelop consistency in the datasets that:
- contains the input data bytes (protobuf,avro, ...) in the raw data field (unless the input data already satisfies the output schema, in which case the data is not wrapped).
- contains a field with the timestamp (either extracted from the protobuf or generated with the ingest time).
- use an avro metadata header field to indicate the field's origin: pnda.field.timestamp.extracted with as value: true | false
contains a field with the source (either the source extracted from the protobuf or generated from the topic name).
- use an avro metadata header field to indicate the field's origin: pnda.field.source.extracted with as value: true | false
Output path selection
The source of the event allows for but multiplexing events from a same source across different topics and yet have the datasets stored under the same sub directory or having events from multiple sources flow through a common topic and yet allowing for demultiplexing them during ingest.
In some scenario's, this may not be desired and the PNDA ingest system allows for this source mapping to be left undefined. The result will be that the events of a same topic will not be demultiplexed but rather stored under the same sub directory.
As discussed above, the system is able to ingest data even in conditions where the event src and timestamp can not be extracted. This situation typically occurs when data is flowing through a topic that has not been configured with a parser information.
If a topic has not been configured with parser information prior to data ingest, the PNDA system will automatically generate a family_id and associated it with the topic that produced the event. This ensures that datasets being generated always have an a family_id defined in the avro metadata header, even though the topic was not directly configured with parser information at ingest time and yet be differentiated from datasets also originating from different yet also un-configured topics.