Configuration
Due to licensing Data Catalog Agent requires packaging with third-party software, unless your datasources are PostgreSQL databases whose support is already included in the docker image of Data Catalog Agent.
In case only PostgreSQL is needed you can skip to the cronjob configuration section.
Datasource support
Mia's Data Catalog marketplace item upon creation, creates also a git repository containing:
- a CI config file
- a docker
Dockerfile
the Dockerfile provides several build steps to include the datasource support you need. Supported datasources are:
- Oracle Database from 11 to 21
- PostgreSQL from 9.6 to 16
- Mysql from 5.7 to 8.2
- Mssql from 2017 to 2022
Upon inspection you'll notice that the Dockerfile contains targets for each database:
FROM ... AS data-catalog-agent-oracle
# rest of the file
FROM ... AS data-catalog-agent-mysql
# and so on...
Each push on the default branch or any tag starting with v will create a new docker image including the selected support. The default support is postgres but it can be changed:
- by editing the
.gitlab-ci.ymlfile overriding the initial value of theDOCKER_TARGETfrompostgresto the database you need to support - one-of change upon manually triggering the CI overriding the same variable
DOCKER_TARGET
Your repository contains a README.md file which also contains a lists of operations needed to include drivers for oracle, mysql, mssql databases.
PostgreSQL: add support
No further actions needed.
Oracle Database: add support
In the template repository, to enable Oracle Database support actions are needed:
- edit the value of
DOCKER_TARGETvariable in.gitlab-ci.ymltooracleor alternatively select it on manual run - embed oracle drivers (see embed drivers)
- [optional] add
tnsnames.oraand/orOracle Wallet(see add tns/wallet).
Embed drivers
In the template repository, download drivers from the official page according with the version you need.
The following assets are needed:
- ODBC driver
- instantclient basiclite
# on unix
export ORACLE_CLIENT_VERSION="<VERSION>"
unzip -j instantclient-basiclite-linux.x64-${ORACLE_CLIENT_VERSION}dbru.zip "instantclient_*/lib*" -d lib/x86_64-unknown-linux-gnu/oracle
unzip -j instantclient-odbc-linux.x64-${ORACLE_CLIENT_VERSION}dbru.zip "instantclient_*/lib*" -d lib/x86_64-unknown-linux-gnu/oracle
create a symlink
export LIBSQORA_NAME=`ls lib/x86_64-unknown-linux-gnu/oracle | grep -E -i '^libsqora\.so\.'`
ln -s "${LIBSQORA_NAME}" lib/x86_64-unknown-linux-gnu/oracle/libsqora.so
Add TNS support
Oracle connectivity can be handled with tnsnames.ora files and/or wallets. The container expects such assets in the /home/agent/oracle/admin folder. The TNS_ADMIN is already set to /home/agent/oracle/admin and the container will copy the content of the oracle/admin folder in this repository.
Drop your wallet assets or tnsnames.ora file in oracle/admin to include them.
In case you want to use a wallet remind to [tune] the sqlora.net file registering the /home/agent/oracle/admin path:
WALLET_LOCATION = (SOURCE = (METHOD = file) (METHOD_DATA = (DIRECTORY="/home/agent/oracle/admin"))) SSL_SERVER_DN_MATCH=yes
MySQL: add support
In the template repository, ODBC driver for MySQL must be downloaded from the official website selecting:
- version
8.3.0 - OS
Linux - Generic - OS version
glibc 2.28 x86_64
then run:
export MYSQL_NAME=mysql-connector-odbc-8.3.0-linux-glibc2.28-x86-64bit
tar -xvf ${MYSQL_NAME}.tar.gz ${MYSQL_NAME}/lib/libmyodbc8w.so
mv ${MYSQL_NAME}/lib/* lib/x86_64-unknown-linux-gnu/mysql
MS SQL Server: add support
No further actions beside editing .gitlab-ci.yml are required.
Cronjob Configuration
From your project select CronJobs > Create new CronJob, insert a unique name for the job, an optional description, the full name of the docker image created by your CI job. It will come in the form <URL TO REGISTRY>/<NAME OF PROJECT>/data-catalog-agent:<TAG>. Also insert a CronJob schedule using the k8s syntax.
By clicking on create, you need to configure:
- the config map containing the configuration of the Data Catalog Agent
- the cronjob yaml file
Create a new config map giving it a name of your choice (e.g., data-catalog-agent-configuration) and mount path /home/agent/.config/dc/agent/. We can now add it to the CronJob yaml manifest.
The prompted yaml for the CronJob should be edited as follows:
apiVersion: batch/v1
kind: CronJob
metadata:
...
spec:
...
jobTemplate:
spec:
...
template:
...
spec:
...
containers:
containers:
- name: ...
...
env:
- name: TNS_ADMIN # not required if it defaults to /home/agent/oracle/admin or no oracle support is needed
value: /path/to/tns/admin/folder
args: ["scan"] # add args here
volumeMounts:
- name: data-catalog-agent-configuration # reference a volume mount
mountPath: /home/agent/.config/dc/agent/
volumes:
- name: data-catalog-agent-configuration # use the name given to the config map
configMap:
name: data-catalog-agent-configuration
Now create a new file in the config map called configuration.json. This file must be a valid Data Catalog Agent configuration.
Data Catalog Agent Configuration
The configuration is read from a file which must be located ad ~/.config/dc/agent/configuration.json where ~ is the user's home folder ($HOME). Such location can be overridden via the environment variable DC_AGENT_CONFIGURATION_FILEPATH which must contain a filepath to a valid configuration file.
The configuration has the following main sections:
- connections
- target
{
"connections": {
...
},
"target": {
...
}
}
Connections
A list of entries to connect to datasources via supported integration methods. The general recipe is:
{
"connections": {
"<CONNECTION_NAME>": {
"type": "<INTEGRATION_METHOD>",
"configuration": {
// depends on the integration method
}
}
}
}
Available integration methods are:
Such drivers come often in the form of a dynamic shared object and must be embedded with the binary and available at runtime (see the template repository README file).
PostgreSQL
To configure a PostgreSQL connection use:
{
"connections": {
"<CONNECTION_NAME>": { // 👈 pick a name
"type": "odbc",
"configuration": {
"vendor": "postgres",
"params": {
"uid": "test_user",
"pwd": "password",
"database": "<DATABASE_NAME>"
}
}
}
}
}
Other keys are host and port which for a PostgreSQL connection are defaulted to 0.0.0.0 and 5432. Any other configuration parameter can be appended using the key flags which must be a semicolon separated string.
Oracle
To configure an Oracle connection use:
{
"connections": {
"<CONNECTION_NAME>": { // 👈 pick a name
"type": "odbc",
"configuration": {
"vendor": "oracle",
"params": {
"uid": "test_user",
"pwd": "password",
"dbq": "<SERVICE_NAME>"
}
}
}
}
}
SERVICE_NAME is an entry of your tnsnames.ora file. Such file MUST be available when running the binary and its location MUST be exported to an environment variable unless defaulted to /home/agent/oracle/admin.
in case you don't want to add the environment variable, the content of one entry of the tnsnames.ora file can be inlined such as:
{
"connections": {
"<CONNECTION_NAME>": {
"type": "odbc",
"configuration": {
"vendor": "oracle",
"params": {
...,
"dbq": "(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=0.0.0.0)(PORT=5041))(CONN ..."
}
}
}
}
}
Other keys are:
versionwhich can be neglected UNLESS USING ORACLE 11g since it does not support pagination viaOFFSET ? ROWS FETCH FIRST ? ROWS ONLY.flagswhich will be added, as a string, to the connection like"flags": "DBA=R;", for the available flags check oracle documentation: 25.4.1 Format of the Connection String.
version MUST be explicit if "11".
Connection with an oracle wallet is also supported:
Preparation must be done on the template repository:
- The whole wallet content must be available under the same path of the
tnsnames.ora(they usually come together) - The file
sqlora.netinside the wallet must be updated to reflect theTNS_ADMINpathWALLET_LOCATION = (SOURCE = (METHOD = file) (METHOD_DATA = (DIRECTORY="/path/to/folder"))) SSL_SERVER_DN_MATCH=yes
Also the environment variable must be set:
- The
TNS_ADMINenvironment variable must be set explicitly unless defaulted to/home/agent/oracle/admin
MS SQL server
To configure a MS SQL Server connection use:
{
"connections": {
"<CONNECTION_NAME>": {
"type": "odbc",
"configuration": {
"vendor": "mssql",
"params": {
"uid": "test_user",
"pwd": "password",
"database": "<DATABASE_NAME>"
}
}
}
}
}
Other keys are host and port which for a PostgreSQL connection are defaulted to 0.0.0.0 and 1433. Any extra connection property can be added via the key flags which will be added, as a string, to the connection, as a semicolon separated string. It is quite useful, for local development purposes to add the flag "TrustServerCertificate=yes".
MySQL
To configure a MySQL connection use:
{
"connections": {
"<CONNECTION_NAME>": {
"type": "odbc",
"configuration": {
"vendor": "mysql",
"params": {
"uid": "test_user",
"pwd": "password",
"database": "<DATABASE_NAME>"
}
}
}
}
}
Other keys are host and port which for a PostgreSQL connection are defaulted to 0.0.0.0 and 3306. Any extra connection property can be added via the key flags which will be added, as a string, to the connection, as a semicolon separated string.
Targets
There are 3 targets available:
- [default] stdout
- file
- mia-console
For each listed connection, after metadata is retrieved, agent sequentially sends data to the target as:
jsonforstdoutandfilendjsonformia-console
The final content is an array of models. Model spec is given in the form of a JSON schema.
To explicitly configure the stdout target use:
{
...,
"target": {
"type": "stdout"
}
}
To configure the file target use:
{
...,
"target": {
"type": "file"
}
}
which will save output files in the folder ./output. To override this use:
{
...,
"target": {
"type": "file",
"dir": "/path/to/dir"
}
}
To configure the mia-console target use:
{
...,
"target": {
"type": "mia-console",
"baseUrl": "https://my-server-url", // 👈 mia console base url
"projectId": "1234", // 👈 models are pushed towards the project with this id
"apiKey": "1234", // 👈 mia console api key, may vary across test/preview/prod environments
"credentials": {
// machine to machine credentials
}
}
}
credentials are used to obtain an access token. In order to do that you must provide:
{
...,
"target": {
...,
"credentials": {
"clientId": "1234",
"clientKeyId": "123",
"privateKey": "tls/key.pem" // 👈 either a file system path or an rsa private key inlined with `\n`
}
}
}
on type mia-console the auth endpoint can be customized using oauthTokenEndpoint. The service endpoint hosted on mia-console can be overridden in 2 ways:
- customizing the revision using the field
revision - or customizing the overall url using the field
dataCatalogEndpoint