Database Preparation
Currently only MongoDB is supported as underlying database for storing the application data.
It is highly recommended to employ a dedicated database / namespace to store Data Catalog data. This allows restricting which applications have access to the management collections and permits to keep the necessary collections all under the same location without additional clutter from other projects.
Data Catalog solution relies on a set of collections to carry out most of its tasks, that is storing data assets definitions and their associated metadata. In order to provide a performant and reliable system, the database must be configured accordingly to support application execution.
This means that required collections must be created, necessary index added and additional schema validation set on them, so that only valid data can be read and written to those collections.
To reach this objective in a straightforward manner we provide an administration command-line interface, which takes care of creating all the necessary resources on selected database. This CLI is offered within as a Docker image, which can be run within your namespace every time a new application version is released.
CLI Configuration
Running the CLI inside your project can be done exploiting a K8s cronjob. In this documentation page it is described how it can be configured, while here are highlighted which specific configurations of the K8s manifest:
- <URL_TO_CONTAINER_REGISTRY>→ is the url to your Container Registry of reference;
- data-fabric/fabric-admin→ image is employed, which should be found within your Container Registry;
- suspend→ the cronjob must be released as suspended, since it should be launched manually only once;
- <NAME_CONTAINER_REGISTRY_SECRET>→ name of the secret where are contained the credentials for connecting to your Container Registry;
- MONGODB_ADMIN_URL→ connection string to your MongoDB database.
Please, beware that the credentials assigned to MongoDB connection string must have the admin permission
to modify collections (collMod), so that the CLI can add a schema validation on the collection. This connection string
might be different from the one employed within Data Catalog application, where only reading and writing is sufficient;
Here, is provided an example of cronjob definition.
apiVersion: batch/v1
kind: CronJob
metadata:
  name: fabric-data-catalog-admin
  labels:
    app: fabric-data-catalog-admin
    # ...
  # ...
spec:
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 1
  schedule: '* * * * *'
  # this job MUST be deployed as suspended, since it just run once
  suspend: true
  jobTemplate:
    spec:
      backoffLimit: 1
      template:
        metadata:
          name: fabric-data-catalog-admin
          labels:
            app: fabric-data-catalog-admin
            # ...
          # ...
        spec:
          imagePullSecrets:
            - name: <NAME_CONTAINER_REGISTRY_SECRET>
          containers:
            - name: fabric-data-catalog-admin
              # current latest stable version of the image is 0.5.2
              image: <URL_TO_CONTAINER_REGISTRY>/data-fabric/fabric-admin:0.5.2
              args: [
                "all",
                "init",
                "--url",
                "{{MONGODB_ADMIN_URL}}",
              ]
          restartPolicy: Never
This cronjob can then be manually triggered as described here
every time a new version of data-fabric/fabric-admin image is available, which it should roughly correspond with a new release of
Data Catalog application.
In case in your Console Project it is possible to enable Custom Resources, then instead of
creating a cronjob manifest it would be possible to define a job one, which does start as soon as released and does not need to be
manually launched.
Cronjob is an alternative better suited in case manually launching the job for administering your database is preferred or enforced by any internal policy (e.g. you need to be granted the permission to edit your database).