Version: 12.4.0

Configuration

Due to licensing Data Catalog Agent requires packaging with third-party software, unless your datasources are PostgreSQL databases whose support is already included in the docker image of Data Catalog Agent.

In case only PostgreSQL is needed you can skip to the cronjob configuration section.

Datasource support

Mia's Data Catalog marketplace item upon creation, creates also a git repository containing:

a CI config file
a docker Dockerfile

the Dockerfile provides several build steps to include the datasource support you need. Supported datasources are:

Oracle Database from 11 to 21
PostgreSQL from 9.6 to 16
Mysql from 5.7 to 8.2
Mssql from 2017 to 2022

Upon inspection you'll notice that the Dockerfile contains targets for each database:

FROM ... AS data-catalog-agent-oracle

# rest of the file

FROM ... AS data-catalog-agent-mysql

# and so on...

Each push on the default branch or any tag starting with v will create a new docker image including the selected support. The default support is postgres but it can be changed:

by editing the .gitlab-ci.yml file overriding the initial value of the DOCKER_TARGET from postgres to the database you need to support
one-of change upon manually triggering the CI overriding the same variable DOCKER_TARGET

Your repository contains a README.md file which also contains a lists of operations needed to include drivers for oracle, mysql, mssql databases.

PostgreSQL: add support

No further actions needed.

Oracle Database: add support

In the template repository, to enable Oracle Database support actions are needed:

edit the value of DOCKER_TARGET variable in .gitlab-ci.yml to oracle or alternatively select it on manual run
embed oracle drivers (see embed drivers)
[optional] add tnsnames.ora and/or Oracle Wallet (see add tns/wallet).

Embed drivers

In the template repository, download drivers from the official page according with the version you need.

The following assets are needed:

ODBC driver
instantclient basiclite

# on unix
export ORACLE_CLIENT_VERSION="<VERSION>"
unzip -j instantclient-basiclite-linux.x64-${ORACLE_CLIENT_VERSION}dbru.zip "instantclient_*/lib*" -d lib/x86_64-unknown-linux-gnu/oracle
unzip -j instantclient-odbc-linux.x64-${ORACLE_CLIENT_VERSION}dbru.zip "instantclient_*/lib*" -d lib/x86_64-unknown-linux-gnu/oracle

create a symlink

export LIBSQORA_NAME=`ls lib/x86_64-unknown-linux-gnu/oracle | grep -E -i '^libsqora\.so\.'`
ln -s "${LIBSQORA_NAME}" lib/x86_64-unknown-linux-gnu/oracle/libsqora.so

Add TNS support

Oracle connectivity can be handled with tnsnames.ora files and/or wallets. The container expects such assets in the /home/agent/oracle/admin folder. The TNS_ADMIN is already set to /home/agent/oracle/admin and the container will copy the content of the oracle/admin folder in this repository.

Drop your wallet assets or tnsnames.ora file in oracle/admin to include them.

In case you want to use a wallet remind to [tune] the sqlora.net file registering the /home/agent/oracle/admin path:

WALLET_LOCATION = (SOURCE = (METHOD = file) (METHOD_DATA = (DIRECTORY="/home/agent/oracle/admin"))) SSL_SERVER_DN_MATCH=yes

MySQL: add support

In the template repository, ODBC driver for MySQL must be downloaded from the official website selecting:

version 8.3.0
OS Linux - Generic
OS version glibc 2.28 x86_64

then run:

export MYSQL_NAME=mysql-connector-odbc-8.3.0-linux-glibc2.28-x86-64bit
tar -xvf ${MYSQL_NAME}.tar.gz ${MYSQL_NAME}/lib/libmyodbc8w.so
mv ${MYSQL_NAME}/lib/* lib/x86_64-unknown-linux-gnu/mysql

MS SQL Server: add support

No further actions beside editing .gitlab-ci.yml are required.

Cronjob Configuration

From your project select CronJobs > Create new CronJob, insert a unique name for the job, an optional description, the full name of the docker image created by your CI job. It will come in the form <URL TO REGISTRY>/<NAME OF PROJECT>/data-catalog-agent:<TAG>. Also insert a CronJob schedule using the k8s syntax.

By clicking on create, you need to configure:

the config map containing the configuration of the Data Catalog Agent
the cronjob yaml file

Create a new config map giving it a name of your choice (e.g., data-catalog-agent-configuration) and mount path /home/agent/.config/dc/agent/. We can now add it to the CronJob yaml manifest.

The prompted yaml for the CronJob should be edited as follows:

apiVersion: batch/v1
kind: CronJob
metadata:
  ...
spec:
  ...
  jobTemplate:
    spec:
      ...
      template:
        ...
        spec:
          ...
          containers:
            containers:
            - name: ...
              ...
              env:
              - name: TNS_ADMIN # not required if it defaults to /home/agent/oracle/admin or no oracle support is needed
                value: /path/to/tns/admin/folder
              args: ["scan"] # add args here
              volumeMounts:
                - name: data-catalog-agent-configuration # reference a volume mount
                  mountPath: /home/agent/.config/dc/agent/
          volumes:
            - name: data-catalog-agent-configuration # use the name given to the config map
              configMap:
                name: data-catalog-agent-configuration

Now create a new file in the config map called configuration.json. This file must be a valid Data Catalog Agent configuration.

Data Catalog Agent Configuration

caution

The configuration is read from a file which must be located ad ~/.config/dc/agent/configuration.json where ~ is the user's home folder ($HOME). Such location can be overridden via the environment variable DC_AGENT_CONFIGURATION_FILEPATH which must contain a filepath to a valid configuration file.

The configuration has the following main sections:

connections
target

{
  "connections": {
    ...
  },
  "target": {
    ...
  }
}

Connections

A list of entries to connect to datasources via supported integration methods. The general recipe is:

{
  "connections": {
    "<CONNECTION_NAME>": {
      "type": "<INTEGRATION_METHOD>",
      "configuration": {
        // depends on the integration method
      }
    }
  }
}

Available integration methods are:

odbc

Such drivers come often in the form of a dynamic shared object and must be embedded with the binary and available at runtime (see the template repository README file).

PostgreSQL

To configure a PostgreSQL connection use:

{
  "connections": {
    "<CONNECTION_NAME>": {  // 👈 pick a name
      "type": "odbc",
      "configuration": {
        "vendor": "postgres",
        "params": {
          "uid": "test_user",
          "pwd": "password",
          "database": "<DATABASE_NAME>"
        }
      }
    }
  }
}

Other keys are host and port which for a PostgreSQL connection are defaulted to 0.0.0.0 and 5432. Any other configuration parameter can be appended using the key flags which must be a semicolon separated string.

Oracle

To configure an Oracle connection use:

{
  "connections": {
    "<CONNECTION_NAME>": {  // 👈 pick a name
      "type": "odbc",
      "configuration": {
        "vendor": "oracle",
        "params": {
          "uid": "test_user",
          "pwd": "password",
          "dbq": "<SERVICE_NAME>"
        }
      }
    }
  }
}

SERVICE_NAME is an entry of your tnsnames.ora file. Such file MUST be available when running the binary and its location MUST be exported to an environment variable unless defaulted to /home/agent/oracle/admin.

in case you don't want to add the environment variable, the content of one entry of the tnsnames.ora file can be inlined such as:

{
  "connections": {
    "<CONNECTION_NAME>": {
      "type": "odbc",
      "configuration": {
        "vendor": "oracle",
        "params": {
          ...,
          "dbq": "(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=0.0.0.0)(PORT=5041))(CONN ..."
        }
      }
    }
  }
}

Other keys are:

version which can be neglected UNLESS USING ORACLE 11g since it does not support pagination via OFFSET ? ROWS FETCH FIRST ? ROWS ONLY.
flags which will be added, as a string, to the connection like "flags": "DBA=R;", for the available flags check oracle documentation: 25.4.1 Format of the Connection String.

caution

version MUST be explicit if "11".

Connection with an oracle wallet is also supported:

Preparation must be done on the template repository:

The whole wallet content must be available under the same path of the tnsnames.ora (they usually come together)
The file sqlora.net inside the wallet must be updated to reflect the TNS_ADMIN path

WALLET_LOCATION = (SOURCE = (METHOD = file) (METHOD_DATA = (DIRECTORY="/path/to/folder"))) SSL_SERVER_DN_MATCH=yes

Also the environment variable must be set:

The TNS_ADMIN environment variable must be set explicitly unless defaulted to /home/agent/oracle/admin

MS SQL server

To configure a MS SQL Server connection use:

{
  "connections": {
    "<CONNECTION_NAME>": {
      "type": "odbc",
      "configuration": {
        "vendor": "mssql",
        "params": {
          "uid": "test_user",
          "pwd": "password",
          "database": "<DATABASE_NAME>"
        }
      }
    }
  }
}

Other keys are host and port which for a PostgreSQL connection are defaulted to 0.0.0.0 and 1433. Any extra connection property can be added via the key flags which will be added, as a string, to the connection, as a semicolon separated string. It is quite useful, for local development purposes to add the flag "TrustServerCertificate=yes".

MySQL

To configure a MySQL connection use:

{
  "connections": {
    "<CONNECTION_NAME>": {
      "type": "odbc",
      "configuration": {
        "vendor": "mysql",
        "params": {
          "uid": "test_user",
          "pwd": "password",
          "database": "<DATABASE_NAME>"
        }
      }
    }
  }
}

Other keys are host and port which for a PostgreSQL connection are defaulted to 0.0.0.0 and 3306. Any extra connection property can be added via the key flags which will be added, as a string, to the connection, as a semicolon separated string.

Targets

There are 3 targets available:

[default] stdout
file
mia-console

For each listed connection, after metadata is retrieved, agent sequentially sends data to the target as:

json for stdout and file
ndjson for mia-console

The final content is an array of models. Model spec is given in the form of a JSON schema.

To explicitly configure the stdout target use:

{
  ...,
  "target": {
    "type": "stdout"
  }
}

To configure the file target use:

{
  ...,
  "target": {
    "type": "file"
  }
}

which will save output files in the folder ./output. To override this use:

{
  ...,
  "target": {
    "type": "file",
    "dir": "/path/to/dir"
  }
}

To configure the mia-console target use:

{
  ...,
  "target": {
    "type": "mia-console",
    "baseUrl": "https://my-server-url", // 👈 mia console base url
    "projectId": "1234", // 👈 models are pushed towards the project with this id
    "apiKey": "1234", // 👈 mia console api key, may vary across test/preview/prod environments
    "credentials": {
      // machine to machine credentials
    }
  }
}

credentials are used to obtain an access token. In order to do that you must provide:

{
  ...,
  "target": {
    ...,
    "credentials": {
      "clientId": "1234",
      "clientKeyId": "123",
      "privateKey": "tls/key.pem" // 👈 either a file system path or an rsa private key inlined with `\n`
    }
  }
}

on type mia-console the auth endpoint can be customized using oauthTokenEndpoint. The service endpoint hosted on mia-console can be overridden in 2 ways:

customizing the revision using the field revision
or customizing the overall url using the field dataCatalogEndpoint

Datasource support​

PostgreSQL: add support​

Oracle Database: add support​

Embed drivers​

Add TNS support​

MySQL: add support​

MS SQL Server: add support​

Cronjob Configuration​

Data Catalog Agent Configuration​

Connections​

PostgreSQL​

Oracle​

MS SQL server​

MySQL​

Targets​