Configuration
Due to licensing Data Catalog Agent requires packaging with third-party software, unless your datasources are PostgreSQL databases whose support is already included in the docker image of Data Catalog Agent.
In case only PostgreSQL is needed you can skip to the cronjob configuration section.
Datasource support
Mia's Data Catalog marketplace item upon creation, creates also a git repository containing:
- a CI config file
- a docker
Dockerfile
the Dockerfile
provides several build steps to include the datasource support you need. Supported datasources are:
- Oracle Database from 11 to 21
- PostgreSQL from 9.6 to 16
- Mysql from 5.7 to 8.2
- Mssql from 2017 to 2022
Upon inspection you'll notice that the Dockerfile
contains targets for each database:
FROM ... AS data-catalog-agent-oracle
# rest of the file
FROM ... AS data-catalog-agent-mysql
# and so on...
Each push on the default branch or any tag starting with v
will create a new docker image including the selected support. The default support is postgres
but it can be changed:
- by editing the
.gitlab-ci.yml
file overriding the initial value of theDOCKER_TARGET
frompostgres
to the database you need to support - one-of change upon manually triggering the CI overriding the same variable
DOCKER_TARGET
Your repository contains a README.md
file which also contains a lists of operations needed to include drivers for oracle
, mysql
, mssql
databases.
Oracle Database: add support
In the template repository, to enable Oracle Database support actions are needed:
- edit the value of
DOCKER_TARGET
variable in.gitlab-ci.yml
tooracle
or alternatively select it on manual run - embed oracle drivers (see embed drivers)
- [optional] add
tnsnames.ora
and/orOracle Wallet
(see add tns/wallet).
Embed drivers
In the template repository, download drivers from the official page according with the version you need.
The following assets are needed:
- ODBC driver
- instantclient basiclite
# on unix
export ORACLE_CLIENT_VERSION="<VERSION>"
unzip -j instantclient-basiclite-linux.x64-${ORACLE_CLIENT_VERSION}dbru.zip "instantclient_*/lib*" -d lib/x86_64-unknown-linux-gnu/oracle
unzip -j instantclient-odbc-linux.x64-${ORACLE_CLIENT_VERSION}dbru.zip "instantclient_*/lib*" -d lib/x86_64-unknown-linux-gnu/oracle
create a symlink
export LIBSQORA_NAME=`ls lib/x86_64-unknown-linux-gnu/oracle | grep -E -i '^libsqora\.so\.'`
ln -s "${LIBSQORA_NAME}" lib/x86_64-unknown-linux-gnu/oracle/libsqora.so
Add TNS support
Oracle connectivity can be handled with tnsnames.ora
files and/or wallets. The container expects such assets in the /home/agent/oracle/admin
folder. The TNS_ADMIN
is already set to /home/agent/oracle/admin
and the container will copy the content of the oracle/admin
folder in this repository.
Drop your wallet assets or tnsnames.ora
file in oracle/admin
to include them.
In case you want to use a wallet remind to [tune] the sqlora.net
file registering the /home/agent/oracle/admin
path:
WALLET_LOCATION = (SOURCE = (METHOD = file) (METHOD_DATA = (DIRECTORY="/home/agent/oracle/admin"))) SSL_SERVER_DN_MATCH=yes
MySQL: add support
In the template repository, ODBC driver for MySQL must be downloaded from the official website selecting:
- version
8.3.0
- OS
Linux - Generic
- OS version
glibc 2.28 x86_64
then run:
export MYSQL_NAME=mysql-connector-odbc-8.3.0-linux-glibc2.28-x86-64bit
tar -xvf ${MYSQL_NAME}.tar.gz ${MYSQL_NAME}/lib/libmyodbc8w.so
mv ${MYSQL_NAME}/lib/* lib/x86_64-unknown-linux-gnu/mysql
MS SQL Server: add support
No further actions beside editing .gitlab-ci.yml
are required.
Mia CRUD Service: add support
By default, running the CI of this project produces an artifact with Mia CRUD Service support since it does not require any specific drivers.
Salesforce SObjects API: add support
By default, running the CI of this project produces an artifact with Salesforce SObjects support since it does not require any specific drivers.
Cronjob Configuration
From your project select CronJobs
> Create new CronJob
, insert a unique name for the job, an optional description, the full name of the docker image created by your CI job. It will come in the form <URL TO REGISTRY>/data-catalog/agent:<TAG>
. Also insert a CronJob schedule
using the k8s syntax.
By clicking on create, you need to configure:
- the config map containing the configuration of the Data Catalog Agent
- the cronjob yaml file
Create a new config map giving it a name of your choice (e.g., data-catalog-agent-configuration
) and mount path /home/agent/.config/dc/agent/
. We can now add it to the CronJob
yaml manifest.
The prompted yaml for the CronJob
should be edited as follows:
apiVersion: batch/v1
kind: CronJob
metadata:
...
spec:
...
jobTemplate:
spec:
...
template:
...
spec:
...
containers:
containers:
- name: ...
...
env:
- name: TNS_ADMIN # not required if it defaults to /home/agent/oracle/admin or no oracle support is needed
value: /path/to/tns/admin/folder
args: ["scan"] # add args here
volumeMounts:
- name: data-catalog-agent-configuration # reference a volume mount
mountPath: /home/agent/.config/dc/agent/
volumes:
- name: data-catalog-agent-configuration # use the name given to the config map
configMap:
name: data-catalog-agent-configuration
Now create a new file in the config map called configuration.json
. This file must be a valid Data Catalog Agent configuration.
Data Catalog Agent Configuration
The configuration is read from a file which must be located ad ~/.config/dc/agent/configuration.json
where ~
is the user's home folder ($HOME
). Such location can be overridden via the environment variable DC_AGENT_CONFIGURATION_FILEPATH
which must contain a filepath to a valid configuration file.
The configuration has the following main sections:
- connections
- target
{
"connections": {
...
},
"target": {
...
}
}
Secretable fields are marked in the following sections.
Connections
A list of entries to connect to datasources via supported integration methods. The general recipe is:
{
"connections": {
"<CONNECTION_NAME>": {
"type": "<INTEGRATION_METHOD>",
"configuration": {
// depends on the integration method
},
"settings": {
// depends on the integration method
}
}
}
}
Available integration methods are:
odbc
http
Such drivers come often in the form of a dynamic shared object
and must be embedded with the binary and available at runtime (see the template repository README file).
A non-required object settings
can be added to the connection configuration to define the following additional properties:
namespace
: string that overrides the default connection name;batchSize
: number that defines the number of tables that can be paginated in a odbc query. if not specified, 1000 tables for query will be retrieved.columnBatchSize
: number that defines the number of columns that are expected to be processed by each batch iteration. if not specified, 20 columns for batch will be used.
PostgreSQL
To configure a PostgreSQL ODBC connection use:
{
"connections": {
"<CONNECTION_NAME>": { // 👈 pick a name
"type": "odbc",
"configuration": {
"vendor": "postgres",
"params": {
"uid": "test_user",
"pwd": "password",
"database": "<DATABASE_NAME>"
}
}
}
}
}
or use an inline ODBC connection string
{
"connections": {
"<CONNECTION_NAME>": { // 👈 pick a name
"type": "odbc",
"configuration": {
"vendor": "postgres",
"params": "DRIVER=postgres;SERVER=0.0.0.0;PORT=5432;DATABASE=db;UID=user;PWD=pwd;"
}
}
}
}
Other keys are host
and port
which for a PostgreSQL connection are defaulted to 0.0.0.0
and 5432
. Any other configuration parameter can be appended using the key flags
which must be a semicolon separated string.
Secretable fields
uid
, pwd
or params
support secrets resolution.
Oracle
To configure an Oracle ODBC connection use:
{
"connections": {
"<CONNECTION_NAME>": { // 👈 pick a name
"type": "odbc",
"configuration": {
"vendor": "oracle",
"params": {
"uid": "test_user",
"pwd": "password",
"dbq": "<SERVICE_NAME>"
}
}
}
}
}
or use an inline ODBC connection string and also set the TNS_ADMIN environment variable
to access a tnsnames.ora
file where the DBQ name can be resolved
{
"connections": {
"<CONNECTION_NAME>": { // 👈 pick a name
"type": "odbc",
"configuration": {
"vendor": "oracle",
"params": {
"uid": "test_user",
"pwd": "password",
// 👇 this must be defined in your tnsnames.ora file
"dbq": "DRIVER=oracle;UID=user;PWD=p4ssw0rd;DBQ=DATABASE_NAME"
}
}
}
}
}
SERVICE_NAME
is an entry of your tnsnames.ora
file. Such file MUST be available when running the binary and its location MUST be exported to an environment variable unless defaulted to /home/agent/oracle/admin
.
in case you don't want to add the environment variable, the content of one entry of the tnsnames.ora
file can be inlined such as:
{
"connections": {
"<CONNECTION_NAME>": {
"type": "odbc",
"configuration": {
"vendor": "oracle",
"params": {
...,
"dbq": "(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=0.0.0.0)(PORT=5041))(CONN ..."
}
}
}
}
}
Other keys are:
version
which can be neglected UNLESS USING ORACLE 11g since it does not support pagination viaOFFSET ? ROWS FETCH FIRST ? ROWS ONLY
.cautionversion
MUST be explicit atconfiguration
level.{
"type": "odbc",
"configuration": {
"vendor": "oracle",
"version": 11,
"params": { /** ... */ }
}
}flags
which will be added, as a string, to the connection like"flags": "DBA=R;"
, for the available flags check oracle documentation: 25.4.1 Format of the Connection String.options
field for oracle odbc configurations: this field defines the tables that will be used for retrieve the metadata. If not specified,user
option will be used as default. Three options available:user
: tablesuser_tables
,user_views
anduser_mviews
will be used. No filter needs to be specified since they default to the user's schema.{
"type": "odbc",
"configuration": {
"vendor": "oracle",
"params": { /** ... */ },
"options": {
"user"
}
}
}all
: tablesall_tables
,all_views
andall_mviews
will be used. Optionally, a list of schemas needs to be provided inowners
field to filter the desired schemas.{
"type": "odbc",
"configuration": {
"vendor": "oracle",
"params": { /** ... */ },
"options": {
"all": {
"owners": ["some_schema_name"]
}
}
}
}dba
: tablesdba_tables
,dba_views
anddba_mviews
will be used. Optionally, a list of schemas needs to be provided inowners
field to filter the desired schemas.{
"type": "odbc",
"configuration": {
"vendor": "oracle",
"params": { /** ... */ },
"options": {
"dba": {
"owners": ["some_schema_name"]
}
}
}
}cautionIn this case, remember that password needs to be written in the form of
<password> AS SYSDBA
.
Connection with an oracle wallet is also supported:
Preparation must be done on the template repository:
- The whole wallet content must be available under the same path of the
tnsnames.ora
(they usually come together) - The file
sqlora.net
inside the wallet must be updated to reflect theTNS_ADMIN
pathWALLET_LOCATION = (SOURCE = (METHOD = file) (METHOD_DATA = (DIRECTORY="/path/to/folder"))) SSL_SERVER_DN_MATCH=yes
Also the environment variable must be set:
- The
TNS_ADMIN
environment variable must be set explicitly unless defaulted to/home/agent/oracle/admin
Secretable fields
uid
, pwd
or params
support secrets resolution.
MS SQL server
To configure a MS SQL Server ODBC connection use:
{
"connections": {
"<CONNECTION_NAME>": {
"type": "odbc",
"configuration": {
"vendor": "mssql",
"params": {
"uid": "test_user",
"pwd": "password",
"database": "<DATABASE_NAME>"
}
}
}
}
}
or use an inline ODBC connection string:
{
"connections": {
"<CONNECTION_NAME>": {
"type": "odbc",
"configuration": {
"vendor": "mssql",
"params": "Driver=mssql;Server=0.0.0.0,1433;Database=db;Uid=user;Pwd=p4ssw0rd;TrustServerCertificate=yes;"
}
}
}
}
Other keys are host
and port
which for a PostgreSQL connection are defaulted to 0.0.0.0
and 1433
. Any extra connection property can be added via the key flags
which will be added, as a string, to the connection, as a semicolon separated string. It is quite useful, for local development purposes to add the flag "TrustServerCertificate=yes"
.
Secretable fields
uid
, pwd
or params
support secrets resolution.
MySQL
To configure a MySQL ODBC connection use:
{
"connections": {
"<CONNECTION_NAME>": {
"type": "odbc",
"configuration": {
"vendor": "mysql",
"params": {
"uid": "test_user",
"pwd": "password",
"database": "<DATABASE_NAME>"
}
}
}
}
}
or use an inline ODBC connection string:
{
"connections": {
"<CONNECTION_NAME>": {
"type": "odbc",
"configuration": {
"vendor": "mysql",
"params": "DRIVER=mysql;SERVER=0.0.0.0;PORT=3306;DATABASE=db;UID=user;PWD=p4ssw0rd;"
}
}
}
}
Other keys are host
and port
which for a PostgreSQL connection are defaulted to 0.0.0.0
and 3306
. Any extra connection property can be added via the key flags
which will be added, as a string, to the connection, as a semicolon separated string.
Secretable fields
uid
, pwd
or params
support secrets resolution.
Mia CRUD Service
To configure a Mia CRUD Service connection use:
{
"connections": {
"<CONNECTION_NAME>": {
"type": "http",
"configuration": {
"vendor": "mia-crud-service",
"params": {
"baseUrl": "http://mia-crud-service:3000",
"endpoint": "/-/schemas",
"healthcheck": "/-/healthz",
"headers": {
"accept": "application/x-ndjson"
}
}
}
}
}
}
The driver basically calls the /-/schemas
URL to extract all the data models from MongoDB. Since the response is, of course, a JSON Schema, some information in the ddl
object may be missing, like the size
of the properties.
Additionally, the version of the CRUD Service is obtained from the /-/healthz
endpoint.
In case custom tls
parameters are required such as a custom root CA or an insecure HTTPS connection,
use the parameter tls
with keys insecure
and/or certificate
:
{
"connections": {
"<CONNECTION_NAME>": {
"type": "http",
"configuration": {
...,
"params": {
"tls": {
"insecure": true
}
}
}
}
}
}
or
{
"connections": {
"<CONNECTION_NAME>": {
"type": "http",
"configuration": {
...,
"params": {
"tls": {
"certificate": {
"type": "file",
"path": "/path/to/root/CA"
}
}
}
}
}
}
}
Salesforce SObjects
To configure a Salesforce SObjects HTTP connection you can use two authentication methods:
- jwt-bearer:
{
"connections": {
"<CONNECTION_NAME>": {
"type": "http",
"configuration": {
"vendor": "salesforce-sobjects",
"params": {
"authenticationFlow": "jwt-bearer",
"baseUrl": "https://my-subdomain.my.salesforce.com",
"clientId": "XXXXXXXXXXXXXXXXXXXXXXXXXX",
"privateKey": "/path/to/private-key/key.pem",
"username": "my-user@email.com",
"apiVersion": "59.0"
}
}
}
}
} - username-password: (deprecated)
{
"connections": {
"<CONNECTION_NAME>": {
"type": "http",
"configuration": {
"vendor": "salesforce-sobjects",
"params": {
"authenticationFlow": "username-password",
"baseUrl": "https://my-subdomain.my.salesforce.com",
"apiVersion": "31.0",
"clientId": "XXXXXXXXXXXXXXXXXXXXXXXXXX",
"clientSecret": "XXXXXXXXXXXXXXXXXXXXXXXXXX",
"username": "my-user@email.com",
"password": "XXXXXXXXXXXXXXXXXXXXXXXXXX",
"securityToken": "XXXXXXXXXXXXXXXXXXXXXXXXXX"
}
}
}
}
}
The range of supported
apiVersion
goes from31.0
to59.0
The driver uses the Salesforce SObjects API to retrieve the schema from all the SObject
s you have. Make sure the user you configure has the right permissions to retrieve all the SObject
s you need.
If not all the SObjects
are retrieved you'll have to follow these steps to expose them through the API:
For custom SObjects
- Go to your Salesforce dashboard
- Access Setup
- Access the "Permission Sets" section
- Create a new Permission Set (e.g. API for Custom Objects) with license type "Salesforce API Integration"
- Click on "Object Settings" inside the new Permission Set, then, for each custom SObject, update the visibility in the desired way (e.g. View All, Modify All)
- Assign the new Permission Set to the user with "Salesforce Integration" user license and "Salesforce API Only System Integrations" profile
For standard SObjects
- Go to your Salesforce dashboard
- Access Setup
- Access the "Permission Sets" section
- Create a Permission Set (e.g. API for Standard Objects) without specifying a license type
- Click on "Object Settings" inside the new Permission Set, then, for each standard SObject, update the visibility in the desired way (e.g. View All, Modify All)
- Assign the new Permission Set to the user with "Salesforce Integration" user license and "Salesforce API Only System Integrations" profile
JWT Authorization
To be able to connect to the Salesforce API using the JWT Bearer flow you will need to follow the next steps.
- Follow the steps one and two from this guide: https://help.salesforce.com/s/articleView?id=sf.connected_app_create_api_integration.htm&type=5
- Then you'll need to upload your self-signed certificate in PEM format (step 5 from the guide). If you don't have a certificate follow these instructions:
- Generate a rsa key with the command
openssl genrsa -out tls/key.pem
- Then create a request to sign the certificate with this command
openssl req -new -key tls/key.pem -out tls/sf_connect.csr
- Sign the request with you private key with
openssl x509 -signkey tls/key.pem -in tls/sf_connect.csr -req -out tls/sf_connect.crt
- Upload the generated certificate (
tls/sf_connect.crt
)
- Generate a rsa key with the command
- Now you need to enable the OAuth2 mechanism following the 10th step from the guide, creating a connected app and all.
- Finally you need to enable the JWT access with the step 14 of the guide.
Now you should have everything you need to fill out the configuration parameters. If you are using a testing instance you'll need to set the param loginUrl
to https://test.salesforce.com/
.
Secretable fields
clientId
, username
, clientSecret
, password
, securityToken
or privateKey
support secrets resolution.
Targets
There are 4 targets available:
- [default]
stdout
mongodb
file
mia-console
For each listed connection, after metadata is retrieved, agent
sequentially sends data to the target as:
The final content is an array
of models, where the format of its records changes accordingly to the target:
stdout
,file
andmia-console
: the models are written in the native agent format, which is defined in the following JSON schema;mongodb
: the models are written in a format that is supported by the Data Catalog application, as defined in the following JSON schema;
Standard Output
To explicitly configure the stdout
target use:
{
...,
"target": {
"type": "stdout"
}
}
MongoDB
The MongoDB target enables Data Catalog Agent to feed data from external sources to the Data Catalog application.
To configure the mongodb
target use:
{
// ...
"target": {
"type": "mongodb",
"url": "mongodb://test:27017/?replicaSet=rs", // 👈 mongodb connection string: the database must be a replica set
"database": "test_database", // 👈 if defined, it will be used as default database to store the models
}
}
The target will write the content of the connections to a MongoDB replica set database, in a collection named open-lineage-datasets
.
To enforce document validation on that collection, be sure to run Data Catalog Configuration Scripts before executing the agent.
File
To configure the file
target use:
{
...,
"target": {
"type": "file"
}
}
which will save output files in the folder ./output
. To override this use:
{
...,
"target": {
"type": "file",
"dir": "/path/to/dir"
}
}
MIA Console
This target has been deprecated in favour of mongodb
to support Data Catalog solution.
To configure the mia-console
target use:
{
...,
"target": {
"type": "mia-console",
"baseUrl": "https://my-server-url", // 👈 mia console base url
"projectId": "1234", // 👈 models are pushed towards the project with this id
"apiKey": "1234", // 👈 mia console api key, may vary across test/preview/prod environments
"credentials": {
// machine to machine credentials
}
}
}
credentials are used to obtain an access token. In order to do that you must provide:
{
...,
"target": {
...,
"credentials": {
"clientId": "1234",
"clientKeyId": "123",
"privateKey": {"type": "file", "path": "tls/key.pem"} // 👈 either a file system path or an rsa private key inlined with `\n`
}
}
}
on type
mia-console
the auth endpoint can be customized using oauthTokenEndpoint
. The service endpoint hosted on mia-console
can be overridden in 2 ways:
- customizing the revision using the field
revision
- or customizing the overall url using the field
dataCatalogEndpoint
Secretable fields
clientId
, clientKeyId
, or privateKey
support secrets