Skip to main content
Version: 13.x (Current)

Job Runner

Data Catalog Job Runner can trigger the execution of different jobs that can update the state of the Data Catalog solution.

The service keeps an internal queue where different requests sent by the Data Catalog UI are processed and executed, while collecting feedback that can be sent back to clients that need to obtain information about the status of the current jobs that are running.

Available Jobs

Here are listed the main jobs that the service can execute.

Agent Dataset Scan

Agent Dataset Scan is a procedure that queries datasources for the data schemas of the resources they own and stores them into the Data Catalog as documents compliant to the Open Lineage specification. The job provides facilities to:

  1. query tables schemas;
  2. aggregate them in a unique JSON format;
  3. store the resulting datasets into the MongoDb database of the Data Catalog application.

This job can be triggered by configuring a Connection from its section in the Data Catalog UI.

Odbc Connections

In the context of databases, it requires Open Database Connectivity (ODBC) drivers to connect to external DBMS. The service comes already equipped with the following ODBC drivers:

  • ODBC Driver 18 for SQL Server,
  • MySQL ODBC 9.1 Unicode Driver,
  • MySQL ODBC 9.1 ANSI Driver,
  • MySQL ODBC 8.0 Unicode Driver,
  • MySQL ODBC 8.0 ANSI Driver,
  • PostgreSQL ANSI,
  • PostgreSQL Unicode,
  • Oracle 23 ODBC driver,
  • Saphana HDBODBC

Data Catalog Sync

Data Catalog Sync is a procedure that is automatically scheduled after an Agent Dataset Scan job: its duty is to align the datasets retrieved previously by the Agent and convert theminto assets that can be managed by the Data Catalog application.

Connection Cleanup

Connection Cleanup is a set of procedures that are launched automatically when a user deletes a connection from the Data Catalog UI. These jobs will delete all assets related to a connection. In particular:

  • datasetsCleanupJob: will delete Open Lineage datasets linked to a connection;
  • jobsCleanupJob: will delete all Jobs defined for the connection's datasets.

Fast Data Jobs Sync

Fast Data Jobs Sync is a procedure that generates Open Lineage Jobs from existing Fast Data pipelines of the Runtime Management.

This jobs can be linked to existing Mia Platform CRUD connections where the proper namespace details have been defined, so that they can be accessible from the Data Lineage section.

Trigger Job

To launch the Fast Data Jobs Sync procedure you have to manually invoke the gRPC method.

Data Catalog Application is shipped with an already pre-configured grpcui, that is exposed under proper permissions as part of the Secure Access microfrontend.

From there, you can pick the JobRunner service, choose the Run method and then pick the openLineageFastDataJobSync option as request. You need also to specify the name of the producer as a string parameter: this information will be displayed in the Data Catalog UI Job Details section.

Users Management UI

tip

If you've already set up a Data Catalog Application that didn't have a grpcui microservice, you can create one with the following parameters:

  • Docker Image Name: fullstorydev/grpcui:v1.4.1;
  • Command Line Arguments:
    -plaintext
    -base-path=/ui/job-runner
    -bind=0.0.0.0
    -connect-fail-fast=false
    job-runner:50051
  • Container Ports:
    • Port Name: http
    • Port: 80
    • Target Port: 8080
    • Protocol: TCP
  • Probe Types: TCP

Finally, you can add the endpoint /ui/job-runner (with rewrite base path as /ui/job-runner) having the permissions defined in the Secure Access chapter. Remember also to add admin:producers to the set of permissions.

Configuration

Environment Variables

Fabric BFF can be customized using the following environment variables:

NameRequiredDescriptionDefault Value
GRPC_PORT-This variable determines the TCP port where the GRPC controller binds its listener50051
LOG_LEVEL-Specify the centralized application log level, choosing from options such as debug, error, info, trace or warninfo
JOB_RUNNER_FOLDER-Set the location of the configuration files~/.fd/job-runner
FAST_DATA_JOBS_FOLDER-Set the location of Fast Data Jobs Sync configuration file.~/.fd/fast-data-jobs
ODBCINI-Optional full path where a custom .odbc.ini can be used to retrieve user defined data sources~/.fd/job-runner/.odbc.ini

Main Configuration

The main configuration of the service is handled by a config.json, located at the path defined by the JOB_RUNNER_FOLDER.

When instantiating Data Catalog application, Job Runner service configuration is generated with a dedicated Config Map, named job-runner-config. This file contains a template configuration that should help in configuring the service.

Loading ....

In the paragraph below is explained how to configure the persistence layer.

Persistence Layer

info

Currently only MongoDB is supported as persistence layer for storing relevant data.

caution

The MongoDB database selected for storing Data Catalog data must be configured to have replicaSet enabled, since Job Runner exploits features that can be used only when a replicaSet is available.

In order to carry out all its operations, Job Runner requires a persistence layer where relevant information, such as auditing details, are stored. This configuration can be set under the persistence.configuration key of the configuration file. The main properties are:

  • url the connection string to your MongoDB instance;
  • database the database name where to search for the collections relevant to Job Runner service. Please notice that setting this property will override the database name potentially set in the connection string.
tip

The following properties support secrets resolution:

  • persistence.configuration.url
  • persistence.configuration.database

Fast Data Jobs Sync Configuration

Since the Fast Data Job Sync may need to retrieve Control Plane pipelines from a different database, it's also possible to provide an additional persistence configuration to specify the databases to retrieve data. This optional configuration can be read from a config.json file located at the path defined by the FAST_DATA_JOBS_FOLDER environment variable.

Loading ....
tip

The following properties support secrets resolution:

  • controlPlane.configuration.url
  • controlPlane.configuration.database

Secrets Names Configuration

To let the Data Catalog UI be aware of possible secret that can be used within connection definitions, the service can be additionally configured with a secret.json file located at the JOB_RUNNER_FOLDER.

The configuration is a key-value pair JSON representation where:

Loading ....
caution

When the service receives a request to trigger a job, it will first try to resolve the secrets contained in the request by looking at the entries of this file: if no secret can be found, the service will try to read the corresponding secret from the process environment.

ODBC Configuration

An ODBC connection can configured either by the means of Connection Strings or Data Source Names (DSN).

A Connection String is a string of key-value pairs where each of them represents a parameter of your connection.

Data Source Names (DSN) are symbolic names that can be linked to an ODBC connection. Job Runner service can read into a user-defined .odbc.ini file to load at runtime the ODBC connections that can be executed through DSN. The list of available DSN then can be used by the Data Catalog UI while configuring an ODBC connection.

Job Runner reads DSN in its user-defined .odbc.ini located by default in the main configuration folder. To learn more about DSN and their connection parameters, please refer to the official documentation.

Here are listed some example for the supported ODBC drivers:

Oracle

Job Runner needs additional configuration to run jobs towards Oracle databases.

With Oracle databases, connectivity can be handled with tnsnames.ora files and/or wallets.

TS Names

The service expects such assets in the Config Map located at /opt/oracle/instantclient/network/admin/, which must be checked to not overwrite the content of the whole folder.

Wallets

Oracle Wallet contains authentication and signal credentials of a database, which needs to be built with the whole service.

To add an oracle wallet to Job Runner service, you have to __replace the microservice container with a custom container located in your registry. This custom container consists of a Dockerfile, which uses the Job Runner service as base layer and adds on top of it the desired wallets.

Once the custom wallets are copied into the container's /opt/oracle/instantclient/network/admin/, you can tune the sqlora.net file registering the path /opt/oracle/instantclient/network/admin/. This path can be overridden by the means of the environment variable TNS_ADMIN.

Dockerfile
FROM nexus.mia-platform.eu/data-fabric/job-runner:<latest version>
COPY <wallet location> /opt/oracle/instantclient/network/admin/

COPY <<EOF /opt/oracle/instantclient/network/admin/sqlora.net
WALLET_LOCATION = (SOURCE = (METHOD = file) (METHOD_DATA = (DIRECTORY="/opt/oracle/instantclient/network/admin"))) SSL_SERVER_DN_MATCH=yes

EOF

The list of all the available connection parameters can be found in the official Oracle documentation.

driver=Oracle 23 ODBC driver;dbq=<SERVERNAME>;uid=<UID>;pwd=<PWD>;

Please note that dbq field can either be the an entry of your tsnames.ora file or an inline entry such as:

(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=0.0.0.0)(PORT=5041))(CONN ...))

PostgreSQL

The list of all the available connection parameters can be found in the official documentation.

driver={PostgreSQL Unicode};server=<HOSTNAME>;port=5432;uid=<UID>;pwd=<PWD>;database=<DATABASE>;

MySQL

The list of all the available connection parameters can be found in the official documentation.

driver={MySQL ODBC 9.1 Unicode Driver};server=<HOSTNAME>;port=5432;uid=<UID>;pwd=<PWD>;database=<DATABASE>;

MSSQL

You can refer to the official documentation for a list of all the supported parameters.

Driver={ODBC Driver 18 for SQL Server};Server=<HOSTNAME>;Database=<DBNAME>;Uid=<USERNAME>;Pwd=<PWD>;TrustServerCertificate=yes

SAP HANA

You can refer to the official documentation for a list of all the supported parameters.

Driver={Saphana HDBODBC};ServerNode=<HOSTNAME>:<PORT>;DatabaseName=<DBNAME>;UID=<USERNAME>;PWD=<PWD>;

gRPC Services

The request exchanged between Fabric BFF and Job Runner services are performed through gRPC.

Thus, on Job Runner service is necessary to advertise the port where the gRPC controller is exposed, which by default is the 50051. This operation can be achieved by setting the proper port to the list of Container Ports that can be found in the Console Design area, under the specific microservice resource.

tip

When instantiating Data Catalog application, Container Ports are pre-filled with all the needed ports using their default value.
In case the gRPC port chosen through environment variables has been edited, please change the Container Ports accordingly.

Here are listed the main services that are exposed by the service.

ODBC Client

Service responsible to retrieve information about the status of the ODBC driver. It has the following methods:

  • ListDrivers: unary RPC returning the list of drivers configured within the service;
  • ListDataSources: unary RPC returning the list of data sources configured within the service.

Configuration

Service responsible to retrieve information about the main configuration of the service. It has the following methods:

Job Runner

Service responsible to interact with the JobRunner service. It has the following methods:

  • Run: unary RPC that receive the context of a job and executes it, returning the corresponding identifier;
  • Abort: unary RPC that, given a job id, stops the execution of a job currently executed by the service;
  • List: unary RPC returning the list of jobs that are currently executed by the service;
  • History: unary RPC returning the list of jobs that have been executed by the service;
  • Status: server streaming RPC returning feedback messages from the active jobs.

Migration Guides

From Data Catalog Agent 0.6.x or 0.7.x

Job Runner allows to run Data Catalog Agent via scheduling instead of a kubernetes cronjob.

Connections, previously configured on Data Catalog Agent, have a 1-to-1 mapping on Job Runner data catalog agent sync jobs.

The following features are NOT supported anymore:

  • connection via custom Data Catalog Agent configuration (only connection-string and DSN allowed).
  • settings.connectionName parameter.
  • target.

Migrate a custom configuration

A Data Catalog Agent connection is a key/value pair

{
"my-connection": {
"type": "odbc",
"configuration": {
"vendor": "oracle",
...
}
}
}

and the Connections UI maps the key to the connection's name while the rest of the configuration is identical in case of a connection string.

When not using connection strings but Data Catalog Agent custom, per-vendor, configuration, the connection string must be extracted from those parameters.

MS SQL Server

The connection string must be in the form

Driver={};Server={},{};Database={};Uid={};Pwd={};{}

Driver must be one of the supported MS SQL Server drivers. Server is a comma separated string formatted as <IP>,<PORT>. Database is the name of the remote catalog/schema. Uid and Pwd are the user credentials. Any additional flag must be a trailing semi-colon separated list of attributes.

caution

We recommend to store the connection string as a secret in the Job Runner and reference it via UI

The other supported connection is via DSN names

MySQL

The connection string must be in the form

DRIVER={};SERVER={};PORT={};DATABASE={};UID={};PWD={};{}

DRIVER must be one of the supported MySQL drivers as reported by the compatibility matrix. SERVER is the IP or Domain name of the database while PORT the socket port. DATABASE is the name of the remote catalog/schema. UID and PWD are the user credentials. Any additional flag must be a trailing semi-colon separated list of attributes.

caution

We recommend to store the connection string as a secret in the Job Runner and reference it via UI

The other supported connection is via DSN names

Oracle

The connection string must be in the form

DRIVER={};UID={};PWD={};DBQ={};{}

DRIVER must be one of the supported Oracle drivers. UID and PWD are the user credentials. DBQ is the reference of a name specified in the tnsnames.ora file (or inline). Any additional flag must be a trailing semi-colon separated list of attributes.

caution

We recommend to store the connection string as a secret in the Job Runner and reference it via UI

The other supported connection is via DSN names

PostgreSQL

The connection string must be in the form

DRIVER={};SERVER={};PORT={};DATABASE={};UID={};PWD={};{}

DRIVER must be one of the supported PostgreSQL drivers. SERVER is the IP or Domain name of the database while PORT the socket port. DATABASE is the name of the remote catalog/schema. UID and PWD are the user credentials. Any additional flag must be a trailing semi-colon separated list of attributes.

caution

We recommend to store the connection string as a secret in the Job Runner and reference it via UI

The other supported connection is via DSN names

PostgreSQL

The connection string must be in the form

driver={};serverNode={}:{};databaseName={};UID={};PWD={};{}

DRIVER must be one of the supported PostgreSQL drivers. serverNode is in the format <IP>:<PORT>. databaseName is the name of the remote catalog/schema. UID and PWD are the user credentials. Any additional flag must be a trailing semi-colon separated list of attributes.

caution

We recommend to store the connection string as a secret in the Job Runner and reference it via UI

The other supported connection is via DSN names

Migrate settings.connectionName

Use the value directly in the Connections UI connection name.

Targets

Within Data Catalog the mia-open-lineage target is the only one allowed and auto-configured from the Job Runner persistence layer.