Skip to main content
Version: 13.x (Current)

Grafana

Mia-Platform PaaS offers to its users a Grafana instances where they can monitor metrics and logs of their own applications.

To access Mia-Platform PaaS Grafana instance click here. Use your Mia-Platform Console credentials to access your Grafana organization. If you can't access to Grafana, request access through our Customer Portal.

Monitoring dashboards

After logging into your Grafana organization, you will have access to a collection of ready-to-use dashboards, enabling you to monitor your applications instantly. Your Mia-Platform PaaS Company is already configured and connected to Grafana, all metrics from your namespaces are automatically collected and available for immediate visualization and analysis.

An overview of Grafana Dashboards

Out-of-the-box dashboards are:

  • Api Gateway APIs
  • Kubernetes Cluster resources
  • Kubernetes Cluster resources pods by namespace
  • Kubernetes Cluster resources workloads by namespace
  • Kubernetes Cluster resources pods
  • Kubernetes Cluster resources workloads
  • Kubernetes Cluster networking
  • Kubernetes Cluster networking pods by namespace
  • Kubernetes Cluster networking workloads by namespace
  • Kubernetes Cluster networking pods
  • Kubernetes Cluster networking workloads
  • Mia-Platform PaaS License Review

Application logs

Grafana's integration with our logging services enables you to access and visualize your application logs. By correlating logs with metrics, you gain deeper insights into application behavior and can efficiently troubleshoot any issues that may arise.

To explore your application logs, simply navigate to the Explore section within Grafana and select Loki datasource. Then, enter your Loki query (e.g.: {namespace="mia-platform"}) and click on Run query to fetch the relevant log data instantly.

An overview of Grafana Logs

Retention policy

We enforce retention policies within our monitoring and logging stack in order to determine how long data is stored in our system. These policies ensure that you can effectively monitor and troubleshoot your applications while maintaining optimal performance levels.

By default, Mia-Platform PaaS applies the following retention policies:

  • Metrics data is retained for 45 days.
  • Logs data is retained for 15 days.

However, we understand that different customers may have specific requirements. If you need longer retention periods, our service allows for easy customization based on your specific needs.

Alerting

Grafana Alerting allows you to learn about problems in your systems moments after they occur [1].

But how does the Grafana Alerting system works? An overview of Grafana Alerting

  • Alert Rules: The evaluation criteria that will fire an alert instance. It is formed by:
    • one or more queries and expressions
    • a condition
    • the frequency of evaluation
    • (optional) the duration over which the condition is met
  • Labels: A label matches an Alert Rule and its set of notification policies and silences
  • Notification policies: Here you can set where, when and how the alerts get routed.
  • Contact points: Define how the alerts are notified to your team

The main page of Grafana Alerting

Create An Alert

To create an alert we need to create the necessary components.

Create an Alert Rule

  • Select 'New Alert Rule' from the side menu (in the 'Alerting' section)
  • Step 0: Select 'Grafana managed alert' or 'Mimir or Loki alert'

If you selected 'Grafana managed alert':

  • Step 1: use query or expression to create your rule:
    • If you need to use a query, select the data source;
    • Add query and expressions;
    • Click 'Run queries' to check the correctness of the query;
  • Step 2: Set an evaluation behavior (The interval and duration of queries valuation);
  • Step 3: Add details:
    • The name, where to store the rule and annotations;
  • Step 4: Specify custom labels to enable contact points.

If you selected 'Mimir or Loki alert':

  • Step 1: Select a compatible data source (Loki/Prometheus) and write your LogQL/PromQL expression;
  • Step 2: Set an evaluation behavior (Duration of time in which the query needs to have a 'true' value before firing the alert);
  • Step 3: Add details:
    • The name, where to store the rule and annotations;
  • Step 4: Specify custom labels to enable contact points.

Suggested Alarm rules

We suggest to add the following rules to your infrastructure to have a basic control on faulting services.

Crashloopbackoff pods To create an Alarm that detects if one or more pods in your system is in a CrashLoopBackOff state:

  • Create a new Alert Rule (Grafana managed alert)
  • Insert kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} as Query A
  • Insert an Expression B with:
    • Operation: Classic condition
    • When: last()
    • OF: A
    • IS ABOVE: 0
  • Set alert condition: B-expression
  • Manage as needed the rest of the configuration (details and contact points)

Throttling pods

  • Create a new Alert Rule (Grafana managed alert)
  • Insert container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total as Query A
  • Insert a Expression B with:
    • Operation: Math
    • Expression: $A > 0.8
  • Set alert condition: B-expression
  • Manage as needed the rest of the configuration (details and contact points). You can use {{ $labels.pod}} to refer to the pod that is throttling.
note

The value in the Expression B will fire if the utilization is above 80%. This value can be changed at will.

CPU limits percentage

  • Create a new Alert Rule (Grafana managed alert)
  • Insert sum(rate(container_cpu_usage_seconds_total{image!="", pod !="", container!="POD"}[5m])) by (pod, container) / sum(container_spec_cpu_quota{image!="", container!="POD"}/container_spec_cpu_period{image!="", pod !="", container!="POD"}) by (pod, container) as Query A
  • Insert an Expression B with:
    • Operation: Reduce
    • Function: Mean
    • Input: A
    • Mode: Strict
  • Insert an Expression C with:
    • Operation: Math
    • Expression: $B > 0.8
  • Set alert condition: C-expression
  • Manage as needed the rest of the configuration (details and contact points). You can use {{ $labels.pod}}, {{ $labels.container }} to refer to the pod that is throttling.
note

The value in the Expression B will fire if the utilization is above 80%. This value can be changed at will.

Manage contact points (chatOps)

This section defines how to set up and manage the services where you want to send your alert notifications.

  • In the alerting section of Grafana, click 'contact points' and 'New Contact Point';
  • Select an alertmanager ('Grafana managed alert' or 'Mimir or Loki alert');
  • Choose a name for the contact point and fill out the mandatory fields of your selected 'Contact point type'.

Example: set up Google chat contact point

To set up a connection with Google chat, in the 'New contact point' menu:

  • Select Google Hangouts Chat
  • Generate a webhook Url and fill out the corresponding field:
    • Go to the Google chat where you want to receive the alerts.
    • Click on the downward triangle next to the Chat name and select 'Manage Webhooks'
    • Add another webhook, copy the Url given by Google chat and insert it in the Grafana Field.

Default alerts

Mia-Platform PaaS Monitoring defines a set of alerting rules by default. Currently, you cannot add custom alerting rules.

These rules refer to AlertManager, Kafka, Kubernetes cluster, and Prometheus. Each rule has one the following 3 severity levels:

  1. Info: An informative message. No action is required;
  2. Warning: Action must be taken to prevent a severe error from occurring in the near future;
  3. Critical: A severe error that might cause the loss or corruption of unsaved data. Immediate action must be taken to prevent losing data.

Below there is a list with all the alerts that you may receive. For each alert you will find a brief description and its severity level.

AlertManager

AlertDescriptionSeverity
AlertmanagerClusterCrashloopingHalf or more of the Alertmanager instances within the same cluster are crashlooping. Alerts could be notified multiple time unless pods are crashing too fast and no alerts can be sent.critical
AlertmanagerClusterDownHalf or more of the Alertmanager instances within the same cluster are down.critical
AlertmanagerClusterFailedToSendAlertsAll Alertmanager instances in a cluster failed to send notifications to a critical integration.warning
AlertmanagerConfigInconsistentThe configuration of the instances of the Alertmanager cluster namespace/services are out of sync.critical
AlertmanagerFailedReloadReloading Alertmanager's configuration has failed for namespace/pods.critical
AlertmanagerFailedToSendAlertsAt least one instance is unable to routed alert to the corresponding integration.warning
AlertmanagerMembersInconsistentAlertmanager has not found all other members of the cluster.critical

Kafka

AlertDescriptionSeverity
KafkaNoActiveControllerThere are no active Kafka Controller.critical
KafkaOfflinePartitionsKafka partitions are offline.critical
KafkaPartitionOver80Kafka Partition disk usage over 80%.critical
KafkaUnderReplicatedPartitionsKafka Partitions are under replicated.critical

Kubernetes Cluster

AlertDescriptionSeverity
ConfigReloaderSidecarErrorsErrors encountered in a pod while the config-reloader sidecar attempts to sync config in namespace. As a result, configuration for service running in the pod may be stale and cannot be updated anymore.warning
CPUThrottlingHighKubernetes containers processes experience elevated CPU throttling.info
KubeClientErrorsKubernetes API server client is experiencing errors.warning
KubeContainerWaitingPod container is waiting longer than 1 hour.warning
KubeCPUOvercommitCluster has overcommitted CPU resource requests.warning
KubeCPUQuotaOvercommitCluster has overcommitted CPU resource requests.warning
KubeDaemonSetMisScheduledDaemonSet pods are not properly scheduled.warning
KubeDaemonSetNotScheduledDaemonSet pods are not scheduled.warning
KubeDaemonSetRolloutStuckDaemonSet rollout is stuck.warning
KubeDeploymentGenerationMismatchDeployment generation mismatch due to possible roll-back.warning
KubeDeploymentReplicasMismatchDeployment has not matched the expected number of replicas.warning
KubeHpaMaxedOutHPA is running at max replicas.warning
KubeHpaReplicasMismatchHPA has not matched desired number of replicas.warning
KubeJobCompletionJob did not complete in time.warning
KubeJobFailedJob failed to complete.warning
KubeletClientCertificateExpirationKubelet client certificate is about to expire.warning
KubeletClientCertificateRenewalErrorsKubelet has failed to renew its client certificate.warning
KubeletDownTarget disappeared from Prometheus target discovery.critical
KubeletPlegDurationHighKubelet Pod Lifecycle Event Generator is taking too long to relist.warning
KubeletPodStartUpLatencyHighKubelet Pod startup latency is too high.warning
KubeletServerCertificateExpirationKubelet server certificate is about to expire.warning
KubeletTooManyPodsThe alert fires when a specific node is running >95% of its capacity of pods.info
KubeletServerCertificateRenewalErrorsKubelet has failed to renew its server certificate.warning
KubeMemoryOvercommitCluster has overcommitted memory resource requests.warning
KubeMemoryQuotaOvercommitCluster has overcommitted memory resource requests.warning
KubeNodeNotReadyNode is not ready.warning
KubeNodeReadinessFlappingNode readiness status is flapping.warning
KubeNodeUnreachableNode is unreachable.warning
KubePersistentVolumeErrorsPersistentVolume is having issues with provisioning.critical
KubePersistentVolumeFillingUpPersistentVolume is filling up.warning
KubePodCrashLoopingPod is crash looping.warning
KubePodNotReadyPod has been in a non-ready state for more than 15 minutes.warning
KubeQuotaExceededNamespace quota has exceeded the limits.warning
KubeQuotaAlmostFullNamespace quota is going to be full.info
KubeQuotaFullyUsedNamespace quota is fully used.info
KubeStatefulSetGenerationMismatchStatefulSet generation mismatch due to possible roll-back.warning
KubeStatefulSetReplicasMismatchDeployment has not matched the expected number of replicas.warning
KubeStatefulSetUpdateNotRolledOutStatefulSet update has not been rolled out.warning
KubeStateMetricsListErrorskube-state-metrics is experiencing errors in list operations.critical
KubeStateMetricsShardingMismatchkube-state-metrics pods are running with different --total-shards configuration, some Kubernetes objects may be exposed multiple times or not exposed at all.critical
KubeStateMetricsShardsMissingkube-state-metrics shards are missing, some Kubernetes objects are not being exposed.critical
KubeStateMetricsWatchErrorskube-state-metrics is experiencing errors in watch operations.critical
KubeVersionMismatchDifferent semantic versions of Kubernetes components running.warning
NodeClockNotSynchronisingClock not synchronizing.warning
NodeClockSkewDetectedClock skew detected.warning
NodeFileDescriptorLimitKernel is predicted to exhaust file descriptors limit soon.warning
NodeFilesystemAlmostOutOfFilesFilesystem has less than 5% inodes left.warning
NodeFilesystemAlmostOutOfSpaceFilesystem has less than 5% space left.warning
NodeFilesystemFilesFillingUpFilesystem is predicted to run out of inodes within the next 24 hours.warning
NodeFilesystemSpaceFillingUpFilesystem is predicted to run out of space within the next 24 hours.warning
NodeHighNumberConntrackEntriesUsedNumber of conntrack are getting close to the limit.warning
NodeNetworkInterfaceFlappingNetwork interface device changing it's up status often on node-exporter namespace/pods.warning
NodeNetworkReceiveErrsNetwork interface is reporting many receive errors.warning
NodeNetworkTransmitErrsNetwork interface is reporting many transmit errors.warning
NodeRAIDDegradedRAID Array is degraded.critical
NodeRAIDDiskFailureFailed device in RAID array.warning
NodeTextFileCollectorScrapeErrorNode Exporter text file collector failed to scrape.warning
PodCrashOOMPod Crashing for OOM.critical

Prometheus

AlertDescriptionSeverity
PrometheusBadConfigFailed Prometheus configuration reload.critical
PrometheusDuplicateTimestampsPrometheus is dropping samples with duplicate timestamps.warning
PrometheusErrorSendingAlertsToAnyAlertmanagerPrometheus encounters more than 3% errors sending alerts to any Alertmanager.critical
PrometheusErrorSendingAlertsToSomeAlertmanagersPrometheus has encountered more than 1% errors sending alerts to a specific Alertmanager.warning
PrometheusMissingRuleEvaluationsPrometheus is missing rule evaluations due to slow rule group evaluation.warning
PrometheusNotConnectedToAlertmanagersPrometheus is not connected to any Alertmanagers.warning
PrometheusNotificationQueueRunningFullPrometheus alert notification queue predicted to run full in less than 30m.warning
PrometheusNotIngestingSamplesPrometheus is not ingesting samples.warning
PrometheusOperatorListErrorsErrors while performing list operations in controller.warning
PrometheusOperatorNodeLookupErrorsErrors while reconciling Prometheus.warning
PrometheusOperatorNotReadyPrometheus operator is not ready.warning
PrometheusOperatorReconcileErrorsErrors while reconciling controller.warning
PrometheusOperatorRejectedResourcesResources rejected by Prometheus operator.warning
PrometheusOperatorSyncFailedLast controller reconciliation failed.warning
PrometheusOperatorWatchErrorsErrors while performing watch operations in controller.warning
PrometheusOutOfOrderTimestampsPrometheus drops samples with out-of-order timestamps.warning
PrometheusRemoteStorageFailuresPrometheus fails to send samples to remote storage.warning
PrometheusRemoteWriteBehindPrometheus remote write is behind.critical
PrometheusRemoteWriteDesiredShardsPrometheus remote write desired shards calculation wants to run more than configured max shards.warning
PrometheusRuleFailuresPrometheus is failing rule evaluations.critical
PrometheusTargetLimitHitPrometheus has dropped targets because some scrape configs have exceeded the targets limit.warning
PrometheusTargetSyncFailureThis alert is triggered when at least one of the Prometheus instances has consistently failed to sync its configuration.warning
PrometheusTSDBCompactionsFailingPrometheus has issues compacting blocks.warning
PrometheusTSDBReloadsFailingPrometheus has issues reloading blocks from disk.warning
TargetDownPercentage of jobs and services down in the target namespace above some threshold.warning