Open Lineage
Features
Open Lineage service provides Catalog features and Data Lineage features to the Data Catalog UI.
In the following documentation, when we reference OpenLineage, we mainly intend the service we provide as a component of Data Catalog application. Whenever we refer to OpenLineage specification, we will provide a link to its documentation.
Data Cataloging
The service expose API to:
- search and retrieve data assets
- visualize data assets details and metadata
- update data assets metadata
Under the hood, data assets are stored adopting an extended version of OpenLineage Dataset format, from which we borrowed the name for this component.
It has been decided to build Data Catalog upon OpenLineage project, which is an open platform that provides an "open standard for lineage data collection", to enable the possibility for future integration with lineage systems and other tools supporting it.
In fact, we strongly believe that adopting a common standard for sharing data products description will be fundamental to foster data products exchange among parties and ease discovering how these products are employed across the different systems.
Data Lineage
The service exposes API to:
- create lineage between assets;
- create and manage virtual assets to define additional lineage definitions.
Under the hood, lineage relationships between assets are adopting the definitions of OpenLineage Jobs format, which represents a process between two or more assets.
Assets like Tables or Systems of Record can be linked to other assets of the same type by the definition of a Job, that can either be ingoing
or outgoing
depending if the process is either producing
or consuming
the selected asset.
When a Job is created through the Open Lineage service API, it will be marked as virtual
since it represent a user-defined process.
Moreover, the service manages the creation and deletion of virtual
assets, to extend the representation of jobs across assets that may not be defined
within the Data Catalog connections.
Configuration
In the following paragraphs are described Open Lineage service configuration and communication interfaces.
Configuration of Open Lineage is a straightforward process that involves setting up a ConfigMap and specifying essential environment variables.
Environment Variables
Control Plane service can be customized using the following environment variables:
Name | Required | Description | Default Value |
---|---|---|---|
HTTP_PORT | - | This variable determines the TCP port where the HTTP controller binds its listener | 3000 |
GRPC_PORT | - | This variable determines the TCP port where the gRPC controller binds its listener | 50051 |
LOG_LEVEL | - | Specify the centralized application log level, choosing from options such as debug , error , info , trace or warn | info |
OPEN_LINEAGE_CONFIGURATION_FILEPATH | - | Set the location of the configuration file | ~/.fd/open-lineage/config.json |
OTEL_EXPORTER_OTLP_ENDPOINT | - | The URL to a GRPC endpoint of an OpenTelemetry Collector. Specifying this value enables the support for OpenTelemetry tracing |
Config Map
The configuration of the service is handled by a JSON file whose location is defined by the OPEN_LINEAGE_CONFIGURATION_FILEPATH
. When instantiating
Data Catalog application, Open Lineage service configuration is generated with
a dedicated Config Map, named open-lineage-config
.
This file contains a template configuration that should help in configuring the service.
- Schema Viewer
- Raw JSON Schema
- Example
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "Configuration",
"description": "Open-Lineage Service configuration",
"type": "object",
"required": [
"cache",
"persistence"
],
"properties": {
"cache": {
"$ref": "#/definitions/Cache"
},
"persistence": {
"$ref": "#/definitions/Persistence"
},
"settings": {
"default": {
"apiPrefix": "",
"auditUserHeader": null
},
"allOf": [
{
"$ref": "#/definitions/Settings"
}
]
}
},
"definitions": {
"Cache": {
"oneOf": [
{
"type": "object",
"required": [
"configuration",
"type"
],
"properties": {
"configuration": {
"$ref": "#/definitions/RedisCache"
},
"type": {
"type": "string",
"enum": [
"redis"
]
}
}
}
]
},
"MongodbPersistence": {
"type": "object",
"required": [
"url"
],
"properties": {
"database": {
"description": "Optional database name. It selects which database to be employed for storing data (it overrides the one provided in the connection string when it is set)",
"default": null,
"type": [
"string",
"null"
]
},
"url": {
"description": "MongoDB connection string",
"allOf": [
{
"$ref": "#/definitions/secret"
}
]
}
}
},
"Persistence": {
"oneOf": [
{
"type": "object",
"required": [
"configuration",
"type"
],
"properties": {
"configuration": {
"$ref": "#/definitions/MongodbPersistence"
},
"type": {
"type": "string",
"enum": [
"mongodb"
]
}
}
}
]
},
"RedisCache": {
"type": "object",
"required": [
"url"
],
"properties": {
"url": {
"$ref": "#/definitions/secret"
}
}
},
"Settings": {
"description": "Service specific configurations",
"type": "object",
"properties": {
"apiPrefix": {
"description": "Prefix path that it is applied to all the exposed APIs, except from the status ones",
"default": "",
"type": "string"
},
"auditUserHeader": {
"description": "Header containing the user unique identifier",
"type": [
"string",
"null"
]
}
}
},
"secret": {
"oneOf": [
{
"type": "string"
},
{
"type": "object",
"required": [
"key",
"type"
],
"properties": {
"encoding": {
"type": "string",
"enum": [
"base64"
]
},
"key": {
"type": "string"
},
"type": {
"const": "env"
}
}
},
{
"type": "object",
"required": [
"path",
"type"
],
"properties": {
"encoding": {
"type": "string",
"enum": [
"base64"
]
},
"key": {
"type": "string"
},
"path": {
"type": "string"
},
"type": {
"const": "file"
}
}
}
]
}
}
}
{
"cache": {
"type": "redis",
"configuration": {
"url": {
"type": "file",
"path": "/run/secrets/data-fabric/open-lineage.ini",
"key": "REDIS_URL"
}
}
},
"persistence": {
"type": "mongodb",
"configuration": {
"url": {
"type": "file",
"path": "/run/secrets/data-fabric/open-lineage.ini",
"key": "MONGODB_URL"
},
"database": "data-fabric-db"
}
}
}
Search Cache
Currently only Redis is supported as search cache for storing relevant data, such as intermediate search results.
Open Lineage enables searching over datasets and their metadata, which may lead to a very large result set, considering that also table's columns can be returned. Consequently, when a user performs a search on the system, at first the system only provides a small result set while storing on a cache the query details that are useful to progress the search operation. Whenever more records are requested, the service loads on-demand the next batch of results leveraging the information stored in the cache.
To configure the cache connection for Open Lineage service there exists a dedicated key, named cache
, in the service configuration file.
Within it there is the possibility to choose the type
(currently only redis
) and provide the needed configuration
property, whose main fields are:
url
→ the connection string to your cache instance;
An example of cache configuration can be seen below:
{
"cache": {
"type": "redis",
"configuration": {
"url": "redis://<redis-instance-address>:6379"
}
},
// ...other control plane configurations
}
The following properties support secret resolution:
cache.configuration.url
When instantiating Data Catalog application, a small Redis instance is added to your project, ready for supporting Open Lineage operations. In case you would like to adopt your own Redis instance, please update the generated configuration accordingly.
Datasets Persistence Layer
Currently only MongoDB is supported as persistence layer for storing relevant data.
The MongoDB database selected for storing Data Catalog data must be configured to have replicaSet
enabled,
since Data Catalog exploits features that can be used only when a replicaSet
is available.
In order to carry out all its operations, Open Lineage requires a persistence layer where relevant information are stored. In particular, it stores data assets and their metadata.
The configuration of persistence layer can be added in the service configuration filer under the property persistence
, where it is possible to
select the type
of database and provide the expected configuration
in the dedicated field. Its main properties of the latter field are:
url
→ the connection string to your MongoDB instance withreplicaSet
enabled;database
→ the database name where to search for the collections relevant to Open Lineage service. Please notice that setting this property will override the database name potentially set in the connection string;
An example of persistence configuration can be seen below:
{
// ...other control plane configurations
"persistence": {
"type": "mongodb",
"configuration": {
"url": "mongodb://<server>:27017/<default-database>?replicaSet=local",
"database": "<data-fabric-database-name>"
}
},
// ...other control plane configurations
}
The following properties support secret resolution:
persistence.configuration.url
persistence.configuration.database
Service Settings
Additionally, the Open Lineage service itself has a set of properties for changing its behavior. Here are listed the available ones within settings
properties:
apiPrefix
→ the base path applied to all the exposed routes. It defaults to/
;auditUserHeader
→ specifies in which HTTP header can be found the user identifier set by the authentication system. The value of this header will be employed to correlate requests stored by the auditing system with the user that performed them. When using Mia-Platform Authentication and Authorization services this property can be set tomiauserid
.
In case it is not set the auditing system does not correlate users with requests;
Here can be found a configuration example:
{
// ...other control plane configurations
"settings": {
"apiPrefix": "/",
"auditUserHeader": "miauserid"
}
}
Enable gRPC communications
Some the request exchanged between Fabric BFF and Open Lineage services are performed through gRPC.
Thus, on Open Lineage service is necessary to advertise the port where the gRPC controller is exposed, which by default is the 50051
.
This operation can be achieved by adding the proper port to the list of Container Ports
that can be found in the Console Design area, under the specific microservice resource. The expected list,
based on default configuration, is shown below in the image.
When instantiating Data Catalog application, Container Ports are pre-filled with all the needed ports using their default value.
In case either the HTTP or the gRPC port chosen through environment variables has been edited, please change the Container Ports accordingly.
Endpoints
Since Open Lineage service only communicates internally with Fabric BFF within the same namespace, it is not necessary to publicly expose any endpoint on the service.
Routes
Here are described which routes Open Lineage service serves over the corresponding endpoints of Fabric BFF microservice.
Data Catalog routes
The following routes are exposed over the /api/data-catalog
endpoint.
Endpoint | Type | Method | Description |
---|---|---|---|
/assets/search | REST | GET | Search for dataset assets and their metadata |
/assets/search-parents | REST | GET | Search for name of system of record or table name |
/bulk-actions | Websocket | GET | Route for handling async operations over multiple datasets records |
/tags/count | REST | GET | Count how many unique tags exists among all data assets |
/tags/items | REST | GET | List existing tags associated to data assets |
/tags/search | REST | GET | Search for a specific tag value |
/sors/:dataset-id | REST | GET | Retrieve selected System of Record details and metadata |
/sors/:dataset-id/custom-properties/:name | REST | POST | Create a custom property for selected System of Record |
/sors/:dataset-id/custom-properties/:name | REST | PATCH | Change a custom property value for selected System of Record |
/sors/:dataset-id/custom-properties/:name | REST | DELETE | Remove a custom property for selected System of Record |
/sors/:dataset-id/description | REST | PATCH | Change the description associated to selected System of Record |
/sors/:dataset-id/tags | REST | PATCH | Change the list of tags associated to selected System of Record |
/tables/:dataset-id | REST | GET | Retrieve selected Table details and metadata |
/tables/:dataset-id/custom-properties/:name | REST | POST | Create a custom property for selected Table |
/tables/:dataset-id/custom-properties/:name | REST | PATCH | Change a custom property value for selected Table |
/tables/:dataset-id/custom-properties/:name | REST | DELETE | Remove a custom property for selected Table |
/tables/:dataset-id/description | REST | PATCH | Change the description associated to selected Table |
/tables/:dataset-id/tags | REST | PATCH | Change the list of tags associated to selected Table |
/columns/:dataset-id/:field-name | REST | GET | Retrieve selected Column details and metadata |
/columns/:dataset-id/:field-name/custom-properties/:name | REST | POST | Create a custom property for selected Column |
/columns/:dataset-id/:field-name/custom-properties/:name | REST | PATCH | Change a custom property value for selected Column |
/columns/:dataset-id/:field-name/custom-properties/:name | REST | DELETE | Remove a custom property for selected Column |
/columns/:dataset-id/:field-name/description | REST | PATCH | Change the description associated to selected Column |
/columns/:dataset-id/:field-name/tags | REST | PATCH | Change the list of tags associated to selected Column |
/metadata-registry/count | REST | GET | Count how many custom properties definition exists |
/metadata-registry/items | REST | GET | List custom properties definition |
/metadata-registry/items | REST | POST | Create a custom property definition |
/metadata-registry/items/:name | REST | GET | Retrieve the definition details of a custom property |
/metadata-registry/items/:name | REST | PATCH | Change the definition of a custom property (e.g. type or description) |
/metadata-registry/items/:name | REST | DELETE | Remove a custom property definition |
/metadata-registry/search | REST | GET | Search for a specific metadata registry of interest |
Open Lineage routes
The following routes are exposed over the /api/open-lineage
endpoint.
Endpoint | Type | Method | Description |
---|---|---|---|
/jobs/from/system/:id/direction/:direction | REST | GET | Retrieve a list of jobs given system of record. Jobs direction can either be ingoing or outgoing |
/jobs/from/dataset/:id/direction/:direction | REST | GET | Retrieve a list of jobs given a table. Jobs direction can either be ingoing or outgoing |
/jobs/items/:id | REST | DELETE | Delete a job between assets. |
/jobs/items/:id | REST | PATCH | Update the details of a job. |
/jobs/items | REST | POST | Create a job between assets. |
/datasets/items/:id | REST | GET | Retrieve the lineage details of a dataset. |
/datasets/items/:id | REST | PATCH | Update the details of a virtual dataset. |
/datasets/items/:id | REST | DELETE | Delete the details of a virtual dataset. |
/dataset/items | REST | POST | Create a virtual dataset. |
/facets/storage/items/:id | REST | GET | Return the OpenLineage specification of a virtual system of record. |
/facets/storage/items/:id | REST | DELETE | Delete a virtual system of record |
Migration Guides
Here are the migration steps to perform between each version of the service.
From 0.1.x to 0.2.0
The following steps needs to be followed:
Run
fabric_admin:0.2.0
cronjob to update persistence layer collections schemas;Add the
/open-lineage/*
endpoints to your project's configuration, along with their permissions.See Routes paragraph to more details. Remember also to update the users management microfrontend to support the new permission
update:lineage
.