Open Lineage
Open Lineage service allows to:
- search and retrieve data assets
- visualize data assets details and metadata
- update data assets metadata
Under the hood, data assets are stored adopting an extended version of OpenLineage Dataset format,
from which we borrowed the name for this component.
It has been decided to build Data Catalog upon OpenLineage project, which is an open platform that provides
an "open standard for lineage data collection", to enable the possibility for future integration with lineage systems and other tools supporting it.
In fact, we strongly believe that adopting a common standard for sharing data products description will be fundamental
to foster data products exchange among parties and ease discovering how these products are employed across the different systems.
In the following documentation, when we reference OpenLineage, we mainly intend the service we provide as a component of Data Catalog application. Whenever we refer to OpenLineage project or standard, we will provide a link to its documentation.
In the following paragraphs are described Open Lineage service configuration and communication interfaces.
Configuration
Configuration of Open Lineage is a straightforward process that involves setting up a ConfigMap and specifying essential environment variables.
Environment Variables
Control Plane service can be customized using the following environment variables:
Name | Required | Description | Default Value |
---|---|---|---|
HTTP_PORT | - | This variable determines the TCP port where the HTTP controller binds its listener | 3000 |
GRPC_PORT | - | This variable determines the TCP port where the gRPC controller binds its listener | 50051 |
LOG_LEVEL | - | Specify the centralized application log level, choosing from options such as debug , error , info , trace or warn | info |
OPEN_LINEAGE_CONFIGURATION_FILEPATH | - | Set the location of the configuration file | ~/.fd/open-lineage/config.json |
OTEL_EXPORTER_OTLP_ENDPOINT | - | The URL to a GRPC endpoint of an OpenTelemetry Collector. Specifying this value enables the support for OpenTelemetry tracing |
Config Map
The configuration of the service is handled by a JSON file whose location is defined by the OPEN_LINEAGE_CONFIGURATION_FILEPATH
. When instantiating
Data Catalog application, Open Lineage service configuration is generated with
a dedicated Config Map, named open-lineage-config
.
This file contains a template configuration that should help in configuring the service.
- Schema Viewer
- Raw JSON Schema
- Example
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "Configuration",
"description": "Open-Lineage Service configuration",
"type": "object",
"required": [
"cache",
"persistence"
],
"properties": {
"cache": {
"$ref": "#/definitions/Cache"
},
"persistence": {
"$ref": "#/definitions/Persistence"
},
"settings": {
"default": {
"apiPrefix": "",
"auditUserHeader": null
},
"allOf": [
{
"$ref": "#/definitions/Settings"
}
]
}
},
"definitions": {
"Cache": {
"oneOf": [
{
"type": "object",
"required": [
"configuration",
"type"
],
"properties": {
"configuration": {
"$ref": "#/definitions/RedisCache"
},
"type": {
"type": "string",
"enum": [
"redis"
]
}
}
}
]
},
"MongodbPersistence": {
"type": "object",
"required": [
"url"
],
"properties": {
"database": {
"description": "Optional database name. It selects which database to be employed for storing data (it overrides the one provided in the connection string when it is set)",
"default": null,
"type": [
"string",
"null"
]
},
"url": {
"description": "MongoDB connection string",
"allOf": [
{
"$ref": "#/definitions/secret"
}
]
}
}
},
"Persistence": {
"oneOf": [
{
"type": "object",
"required": [
"configuration",
"type"
],
"properties": {
"configuration": {
"$ref": "#/definitions/MongodbPersistence"
},
"type": {
"type": "string",
"enum": [
"mongodb"
]
}
}
}
]
},
"RedisCache": {
"type": "object",
"required": [
"url"
],
"properties": {
"url": {
"$ref": "#/definitions/secret"
}
}
},
"Settings": {
"description": "Service specific configurations",
"type": "object",
"properties": {
"apiPrefix": {
"description": "Prefix path that it is applied to all the exposed APIs, except from the status ones",
"default": "",
"type": "string"
},
"auditUserHeader": {
"description": "Header containing the user unique identifier",
"type": [
"string",
"null"
]
}
}
},
"secret": {
"oneOf": [
{
"type": "string"
},
{
"type": "object",
"required": [
"key",
"type"
],
"properties": {
"encoding": {
"type": "string",
"enum": [
"base64"
]
},
"key": {
"type": "string"
},
"type": {
"const": "env"
}
}
},
{
"type": "object",
"required": [
"path",
"type"
],
"properties": {
"encoding": {
"type": "string",
"enum": [
"base64"
]
},
"key": {
"type": "string"
},
"path": {
"type": "string"
},
"type": {
"const": "file"
}
}
}
]
}
}
}
{
"cache": {
"type": "redis",
"configuration": {
"url": {
"type": "file",
"path": "/run/secrets/data-fabric/open-lineage.ini",
"key": "REDIS_URL"
}
}
},
"persistence": {
"type": "mongodb",
"configuration": {
"url": {
"type": "file",
"path": "/run/secrets/data-fabric/open-lineage.ini",
"key": "MONGODB_URL"
},
"database": "data-fabric-db"
}
}
}
Search Cache
Currently only Redis is supported as search cache for storing relevant data, such as intermediate search results.
Open Lineage enables searching over datasets and their metadata, which may lead to a very large result set, considering that also table's columns can be returned. Consequently, when a user performs a search on the system, at first the system only provides a small result set while storing on a cache the query details that are useful to progress the search operation. Whenever more records are requested, the service loads on-demand the next batch of results leveraging the information stored in the cache.
To configure the cache connection for Open Lineage service there exists a dedicated key, named cache
, in the service configuration file.
Within it there is the possibility to choose the type
(currently only redis
) and provide the needed configuration
property, whose main fields are:
url
→ the connection string to your cache instance;
An example of cache configuration can be seen below:
{
"cache": {
"type": "redis",
"configuration": {
"url": "redis://<redis-instance-address>:6379"
}
},
// ...other control plane configurations
}
The following properties support secret resolution:
cache.configuration.url
When instantiating Data Catalog application, a small Redis instance is added to your project, ready for supporting Open Lineage operations. In case you would like to adopt your own Redis instance, please update the generated configuration accordingly.
Datasets Persistence Layer
Currently only MongoDB is supported as persistence layer for storing relevant data.
The MongoDB database selected for storing Data Catalog data must be configured to have replicaSet
enabled,
since Data Catalog exploits features that can be used only when a replicaSet
is available.
In order to carry out all its operations, Open Lineage requires a persistence layer where relevant information are stored. In particular, it stores data assets and their metadata.
The configuration of persistence layer can be added in the service configuration filer under the property persistence
, where it is possible to
select the type
of database and provide the expected configuration
in the dedicated field. Its main properties of the latter field are:
url
→ the connection string to your MongoDB instance withreplicaSet
enabled;database
→ the database name where to search for the collections relevant to Open Lineage service. Please notice that setting this property will override the database name potentially set in the connection string;
An example of persistence configuration can be seen below:
{
// ...other control plane configurations
"persistence": {
"type": "mongodb",
"configuration": {
"url": "mongodb://<server>:27017/<default-database>?replicaSet=local",
"database": "<data-fabric-database-name>"
}
},
// ...other control plane configurations
}
The following properties support secret resolution:
persistence.configuration.url
persistence.configuration.database
Service Settings
Additionally, the Open Lineage service itself has a set of properties for changing its behavior. Here are listed the available ones within settings
properties:
apiPrefix
→ the base path applied to all the exposed routes. It defaults to/
;auditUserHeader
→ specifies in which HTTP header can be found the user identifier set by the authentication system. The value of this header will be employed to correlate requests stored by the auditing system with the user that performed them. When using Mia-Platform Authentication and Authorization services this property can be set tomiauserid
.
In case it is not set the auditing system does not correlate users with requests;
Here can be found a configuration example:
{
// ...other control plane configurations
"settings": {
"apiPrefix": "/",
"auditUserHeader": "miauserid"
}
}
Enable gRPC communications
Some the request exchanged between Fabric BFF and Open Lineage services are performed through gRPC.
Thus, on Open Lineage service is necessary to advertise the port where the gRPC controller is exposed, which by default is the 50051
.
This operation can be achieved by adding the proper port to the list of Container Ports
that can be found in the Console Design area, under the specific microservice resource. The expected list,
based on default configuration, is shown below in the image.
When instantiating Data Catalog application, Container Ports are pre-filled with all the needed ports using their default value.
In case either the HTTP or the gRPC port chosen through environment variables has been edited, please change the Container Ports accordingly.
Endpoints
Since Open Lineage service only communicates internally with Fabric BFF within the same namespace, it is not necessary to publicly expose any endpoint on the service.
Routes
Here are described which routes Open Lineage service serves:
Endpoint | Type | Method | Description |
---|---|---|---|
/assets/search | REST | GET | Search for dataset assets and their metadata |
/assets/search-parents | REST | GET | Search for name of system of record or table name |
/tags/count | REST | GET | Count how many unique tags exists among all data assets |
/tags/items | REST | GET | List existing tags associated to data assets |
/tags/search | REST | GET | Search for a specific tag value |
/sors/:dataset-id | REST | GET | Retrieve selected System of Record details and metadata |
/sors/:dataset-id/custom-properties/:name | REST | POST | Create a custom property for selected System of Record |
/sors/:dataset-id/custom-properties/:name | REST | PATCH | Change a custom property value for selected System of Record |
/sors/:dataset-id/custom-properties/:name | REST | DELETE | Remove a custom property for selected System of Record |
/sors/:dataset-id/description | REST | PATCH | Change the description associated to selected System of Record |
/sors/:dataset-id/tags | REST | PATCH | Change the list of tags associated to selected System of Record |
/tables/:dataset-id | REST | GET | Retrieve selected Table details and metadata |
/tables/:dataset-id/custom-properties/:name | REST | POST | Create a custom property for selected Table |
/tables/:dataset-id/custom-properties/:name | REST | PATCH | Change a custom property value for selected Table |
/tables/:dataset-id/custom-properties/:name | REST | DELETE | Remove a custom property for selected Table |
/tables/:dataset-id/description | REST | PATCH | Change the description associated to selected Table |
/tables/:dataset-id/tags | REST | PATCH | Change the list of tags associated to selected Table |
/columns/:dataset-id/:field-name | REST | GET | Retrieve selected Column details and metadata |
/columns/:dataset-id/:field-name/custom-properties/:name | REST | POST | Create a custom property for selected Column |
/columns/:dataset-id/:field-name/custom-properties/:name | REST | PATCH | Change a custom property value for selected Column |
/columns/:dataset-id/:field-name/custom-properties/:name | REST | DELETE | Remove a custom property for selected Column |
/columns/:dataset-id/:field-name/description | REST | PATCH | Change the description associated to selected Column |
/columns/:dataset-id/:field-name/tags | REST | PATCH | Change the list of tags associated to selected Column |
/metadata-registry/count | REST | GET | Count how many custom properties definition exists |
/metadata-registry/items | REST | GET | List custom properties definition |
/metadata-registry/items | REST | POST | Create a custom property definition |
/metadata-registry/items/:name | REST | GET | Retrieve the definition details of a custom property |
/metadata-registry/items/:name | REST | PATCH | Change the definition of a custom property (e.g. type or description) |
/metadata-registry/items/:name | REST | DELETE | Remove a custom property definition |
/metadata-registry/search | REST | GET | Search for a specific metadata registry of interest |
Migration Guides
Here are the migration steps to perform between each version of the service.
From 0.1.0 to 0.1.1
Configure and execute the Fabric Sync cronjob before releasing the service.