First observation is that we have a requirement to be able to identify the data family associated with a dataset. This translates in a requirement for being able to extract uniformly the data family from a given dataset. For this reason we standardize on an avro serialization and store the data family (irrespective of the input serialization type) in the avro metadata header: pnda.family_id with as value the family id that matched the configuration of this topic producing the input data.
Option 1 (consistency):
This option consists in ensuring an avro envelop consistency (with the same schema as is required for the pnda ingest in release 4.0 and prior releases) in the datasets that:
- contains the input data bytes (protobuf,avro, ...) in the raw data field (unless the input data already satisfies the output schema, in which case the data is not wrapped).
- use an avro metadata header field to indicate the field's origin: pnda.record.content with as value: compliant | wrapped
- contains a field with the timestamp (either extracted from the protobuf or generated with the ingest time).
- use a avro metadata header field to indicate the field's origin: pnda.field.timestamp.origin with as value: extracted | ingest
contains a field with the source (either the source extracted from the protobuf or generated from the topic name).
- use a avro metadata header field to indicate the field's origin: pnda.field.source.origin with as value: extracted | topic
- simplicity: allows to re-use the existing time based partitioning code from gobblin (which is based on an avro record containing a timestamp field)
- consistency in the schema across all generated data sets.
- duplicates data already contained in the protobuf
- requires to parse the metadata header to know the timestamp/source's origin
Option 2 (optimized):
The optimized option consists in avoiding the field duplications proposed in option 1.
1) If the input schema has no timestamp mapping, then a timestamp will be generated by the system (based on ingest time) and will be stored in an envelop. Hence a requirement for the avro schema is to be able to store an optional (i.e. when generated by the system) timestamp.
2) If the input serialization is not avro, then we need to store the data in it's raw format, hence we have a requirement for an envelop that contains raw data.
These 3 requirements (avro serialization, mandatory raw data field and optional timestamp field) can be satisfied with the schema on the bottom right of the decision tree blow.
Now, when the input schema already satisfies those three requirements (and this is the case for legacy systems), then there is no point in wrapping everything in an extra envelop. In this case, the pnda.enity schema will not be used as an envelop but the input schema will be re-used as output schema.
- No field duplication caused by the system
- will require more code to implement/maintain/...
- more complexity in the dataset usage
Current choice is go with the consistency option to keep the ingest mechanism simple for applications consuming the datasets.