Configuration
Due to licensing Data Catalog Agent requires packaging with third-party software, unless your datasources are PostgreSQL databases whose support is already included in the docker image of Data Catalog Agent.
In case only PostgreSQL is needed you can skip to the cronjob configuration section.
Datasource support
Mia's Data Catalog marketplace item upon creation, creates also a git repository containing:
- a CI config file
- a docker
Dockerfile
the Dockerfile
provides several build steps to include the datasource support you need. Supported datasources are:
- Oracle Database from 11 to 21
- PostgreSQL from 9.6 to 16
- Mysql from 5.7 to 8.2
- Mssql from 2017 to 2022
Upon inspection you'll notice that the Dockerfile
contains targets for each database:
FROM ... AS data-catalog-agent-oracle
# rest of the file
FROM ... AS data-catalog-agent-mysql
# and so on...
Each push on the default branch or any tag starting with v
will create a new docker image including the selected support. The default support is postgres
but it can be changed:
- by editing the
.gitlab-ci.yml
file overriding the initial value of theDOCKER_TARGET
frompostgres
to the database you need to support - one-of change upon manually triggering the CI overriding the same variable
DOCKER_TARGET
Your repository contains a README.md
file which also contains a lists of operations needed to include drivers for oracle
, mysql
, mssql
databases.
PostgreSQL: add support
No further actions needed.
Oracle Database: add support
In the template repository, to enable Oracle Database support actions are needed:
- edit the value of
DOCKER_TARGET
variable in.gitlab-ci.yml
tooracle
or alternatively select it on manual run - embed oracle drivers (see embed drivers)
- [optional] add
tnsnames.ora
and/orOracle Wallet
(see add tns/wallet).
Embed drivers
In the template repository, download drivers from the official page according with the version you need.
The following assets are needed:
- ODBC driver
- instantclient basiclite
# on unix
export ORACLE_CLIENT_VERSION="<VERSION>"
unzip -j instantclient-basiclite-linux.x64-${ORACLE_CLIENT_VERSION}dbru.zip "instantclient_*/lib*" -d lib/x86_64-unknown-linux-gnu/oracle
unzip -j instantclient-odbc-linux.x64-${ORACLE_CLIENT_VERSION}dbru.zip "instantclient_*/lib*" -d lib/x86_64-unknown-linux-gnu/oracle
create a symlink
export LIBSQORA_NAME=`ls lib/x86_64-unknown-linux-gnu/oracle | grep -E -i '^libsqora\.so\.'`
ln -s "${LIBSQORA_NAME}" lib/x86_64-unknown-linux-gnu/oracle/libsqora.so
Add TNS support
Oracle connectivity can be handled with tnsnames.ora
files and/or wallets. The container expects such assets in the /home/agent/oracle/admin
folder. The TNS_ADMIN
is already set to /home/agent/oracle/admin
and the container will copy the content of the oracle/admin
folder in this repository.
Drop your wallet assets or tnsnames.ora
file in oracle/admin
to include them.
In case you want to use a wallet remind to [tune] the sqlora.net
file registering the /home/agent/oracle/admin
path:
WALLET_LOCATION = (SOURCE = (METHOD = file) (METHOD_DATA = (DIRECTORY="/home/agent/oracle/admin"))) SSL_SERVER_DN_MATCH=yes
MySQL: add support
In the template repository, ODBC driver for MySQL must be downloaded from the official website selecting:
- version
8.3.0
- OS
Linux - Generic
- OS version
glibc 2.28 x86_64
then run:
export MYSQL_NAME=mysql-connector-odbc-8.3.0-linux-glibc2.28-x86-64bit
tar -xvf ${MYSQL_NAME}.tar.gz ${MYSQL_NAME}/lib/libmyodbc8w.so
mv ${MYSQL_NAME}/lib/* lib/x86_64-unknown-linux-gnu/mysql
MS SQL Server: add support
No further actions beside editing .gitlab-ci.yml
are required.
Cronjob Configuration
From your project select CronJobs
> Create new CronJob
, insert a unique name for the job, an optional description, the full name of the docker image created by your CI job. It will come in the form <URL TO REGISTRY>/<NAME OF PROJECT>/data-catalog-agent:<TAG>
. Also insert a CronJob schedule
using the k8s syntax.
By clicking on create, you need to configure:
- the config map containing the configuration of the Data Catalog Agent
- the cronjob yaml file
Create a new config map giving it a name of your choice (e.g., data-catalog-agent-configuration
) and mount path /home/agent/.config/dc/agent/
. We can now add it to the CronJob
yaml manifest.
The prompted yaml for the CronJob
should be edited as follows:
apiVersion: batch/v1
kind: CronJob
metadata:
...
spec:
...
jobTemplate:
spec:
...
template:
...
spec:
...
containers:
containers:
- name: ...
...
env:
- name: TNS_ADMIN # not required if it defaults to /home/agent/oracle/admin or no oracle support is needed
value: /path/to/tns/admin/folder
args: ["scan"] # add args here
volumeMounts:
- name: data-catalog-agent-configuration # reference a volume mount
mountPath: /home/agent/.config/dc/agent/
volumes:
- name: data-catalog-agent-configuration # use the name given to the config map
configMap:
name: data-catalog-agent-configuration
Now create a new file in the config map called configuration.json
. This file must be a valid Data Catalog Agent configuration.
Data Catalog Agent Configuration
The configuration is read from a file which must be located ad ~/.config/dc/agent/configuration.json
where ~
is the user's home folder ($HOME
). Such location can be overridden via the environment variable DC_AGENT_CONFIGURATION_FILEPATH
which must contain a filepath to a valid configuration file.
The configuration has the following main sections:
- connections
- target
{
"connections": {
...
},
"target": {
...
}
}
Connections
A list of entries to connect to datasources via supported integration methods. The general recipe is:
{
"connections": {
"<CONNECTION_NAME>": {
"type": "<INTEGRATION_METHOD>",
"configuration": {
// depends on the integration method
}
}
}
}
Available integration methods are:
Such drivers come often in the form of a dynamic shared object
and must be embedded with the binary and available at runtime (see the template repository README file).
PostgreSQL
To configure a PostgreSQL connection use:
{
"connections": {
"<CONNECTION_NAME>": { // 👈 pick a name
"type": "odbc",
"configuration": {
"vendor": "postgres",
"params": {
"uid": "test_user",
"pwd": "password",
"database": "<DATABASE_NAME>"
}
}
}
}
}
Other keys are host
and port
which for a PostgreSQL connection are defaulted to 0.0.0.0
and 5432
. Any other configuration parameter can be appended using the key flags
which must be a semicolon separated string.
Oracle
To configure an Oracle connection use:
{
"connections": {
"<CONNECTION_NAME>": { // 👈 pick a name
"type": "odbc",
"configuration": {
"vendor": "oracle",
"params": {
"uid": "test_user",
"pwd": "password",
"dbq": "<SERVICE_NAME>"
}
}
}
}
}
SERVICE_NAME
is an entry of your tnsnames.ora
file. Such file MUST be available when running the binary and its location MUST be exported to an environment variable unless defaulted to /home/agent/oracle/admin
.
in case you don't want to add the environment variable, the content of one entry of the tnsnames.ora
file can be inlined such as:
{
"connections": {
"<CONNECTION_NAME>": {
"type": "odbc",
"configuration": {
"vendor": "oracle",
"params": {
...,
"dbq": "(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=0.0.0.0)(PORT=5041))(CONN ..."
}
}
}
}
}
Other keys are:
version
which can be neglected UNLESS USING ORACLE 11g since it does not support pagination viaOFFSET ? ROWS FETCH FIRST ? ROWS ONLY
.flags
which will be added, as a string, to the connection like"flags": "DBA=R;"
, for the available flags check oracle documentation: 25.4.1 Format of the Connection String.
version
MUST be explicit if "11".
Connection with an oracle wallet is also supported:
Preparation must be done on the template repository:
- The whole wallet content must be available under the same path of the
tnsnames.ora
(they usually come together) - The file
sqlora.net
inside the wallet must be updated to reflect theTNS_ADMIN
pathWALLET_LOCATION = (SOURCE = (METHOD = file) (METHOD_DATA = (DIRECTORY="/path/to/folder"))) SSL_SERVER_DN_MATCH=yes
Also the environment variable must be set:
- The
TNS_ADMIN
environment variable must be set explicitly unless defaulted to/home/agent/oracle/admin
MS SQL server
To configure a MS SQL Server connection use:
{
"connections": {
"<CONNECTION_NAME>": {
"type": "odbc",
"configuration": {
"vendor": "mssql",
"params": {
"uid": "test_user",
"pwd": "password",
"database": "<DATABASE_NAME>"
}
}
}
}
}
Other keys are host
and port
which for a PostgreSQL connection are defaulted to 0.0.0.0
and 1433
. Any extra connection property can be added via the key flags
which will be added, as a string, to the connection, as a semicolon separated string. It is quite useful, for local development purposes to add the flag "TrustServerCertificate=yes"
.
MySQL
To configure a MySQL connection use:
{
"connections": {
"<CONNECTION_NAME>": {
"type": "odbc",
"configuration": {
"vendor": "mysql",
"params": {
"uid": "test_user",
"pwd": "password",
"database": "<DATABASE_NAME>"
}
}
}
}
}
Other keys are host
and port
which for a PostgreSQL connection are defaulted to 0.0.0.0
and 3306
. Any extra connection property can be added via the key flags
which will be added, as a string, to the connection, as a semicolon separated string.
Targets
There are 3 targets available:
- [default] stdout
- file
- mia-console
For each listed connection, after metadata is retrieved, agent
sequentially sends data to the target as:
json
forstdout
andfile
ndjson
formia-console
The final content is an array
of models. Model spec is given in the form of a JSON schema.
To explicitly configure the stdout
target use:
{
...,
"target": {
"type": "stdout"
}
}
To configure the file
target use:
{
...,
"target": {
"type": "file"
}
}
which will save output files in the folder ./output
. To override this use:
{
...,
"target": {
"type": "file",
"dir": "/path/to/dir"
}
}
To configure the mia-console
target use:
{
...,
"target": {
"type": "mia-console",
"baseUrl": "https://my-server-url", // 👈 mia console base url
"projectId": "1234", // 👈 models are pushed towards the project with this id
"apiKey": "1234", // 👈 mia console api key, may vary across test/preview/prod environments
"credentials": {
// machine to machine credentials
}
}
}
credentials are used to obtain an access token. In order to do that you must provide:
{
...,
"target": {
...,
"credentials": {
"clientId": "1234",
"clientKeyId": "123",
"privateKey": "tls/key.pem" // 👈 either a file system path or an rsa private key inlined with `\n`
}
}
}
on type
mia-console
the auth endpoint can be customized using oauthTokenEndpoint
. The service endpoint hosted on mia-console
can be overridden in 2 ways:
- customizing the revision using the field
revision
- or customizing the overall url using the field
dataCatalogEndpoint