Skip to main content
Version: 10.x (Current)

Alerting rules

Mia-Platform PaaS Monitoring defines a set of alerting rules by default. Currently, you cannot add custom alerting rules.

These rules refer to AlertManager, Kafka, Kubernetes cluster, and Prometheus. Each rule has one the following 3 severity levels:

  1. Info: An informative message. No action is required;
  2. Warning: Action must be taken to prevent a severe error from occurring in the near future;
  3. Critical: A severe error that might cause the loss or corruption of unsaved data. Immediate action must be taken to prevent losing data.

Below there is a list with all the alerts that you may receive. For each alert you will find a brief description and its severity level.

AlertManager

AlertDescriptionSeverity
AlertmanagerClusterCrashloopingHalf or more of the Alertmanager instances within the same cluster are crashlooping. Alerts could be notified multiple time unless pods are crashing too fast and no alerts can be sent.critical
AlertmanagerClusterDownHalf or more of the Alertmanager instances within the same cluster are down.critical
AlertmanagerClusterFailedToSendAlertsAll Alertmanager instances in a cluster failed to send notifications to a critical integration.warning
AlertmanagerConfigInconsistentThe configuration of the instances of the Alertmanager cluster namespace/services are out of sync.critical
AlertmanagerFailedReloadReloading Alertmanager's configuration has failed for namespace/pods.critical
AlertmanagerFailedToSendAlertsAt least one instance is unable to routed alert to the corresponding integration.warning
AlertmanagerMembersInconsistentAlertmanager has not found all other members of the cluster.critical

Kafka

AlertDescriptionSeverity
KafkaNoActiveControllerThere are no active Kafka Controller.critical
KafkaOfflinePartitionsKafka partitions are offline.critical
KafkaPartitionOver80Kafka Partition disk usage over 80%.critical
KafkaUnderReplicatedPartitionsKafka Partitions are under replicated.critical

Kubernetes Cluster

AlertDescriptionSeverity
ConfigReloaderSidecarErrorsErrors encountered in a pod while the config-reloader sidecar attempts to sync config in namespace. As a result, configuration for service running in the pod may be stale and cannot be updated anymore.warning
CPUThrottlingHighKubernetes containers processes experience elevated CPU throttling.info
KubeClientErrorsKubernetes API server client is experiencing errors.warning
KubeContainerWaitingPod container is waiting longer than 1 hour.warning
KubeCPUOvercommitCluster has overcommitted CPU resource requests.warning
KubeCPUQuotaOvercommitCluster has overcommitted CPU resource requests.warning
KubeDaemonSetMisScheduledDaemonSet pods are not properly scheduled.warning
KubeDaemonSetNotScheduledDaemonSet pods are not scheduled.warning
KubeDaemonSetRolloutStuckDaemonSet rollout is stuck.warning
KubeDeploymentGenerationMismatchDeployment generation mismatch due to possible roll-back.warning
KubeDeploymentReplicasMismatchDeployment has not matched the expected number of replicas.warning
KubeHpaMaxedOutHPA is running at max replicas.warning
KubeHpaReplicasMismatchHPA has not matched desired number of replicas.warning
KubeJobCompletionJob did not complete in time.warning
KubeJobFailedJob failed to complete.warning
KubeletClientCertificateExpirationKubelet client certificate is about to expire.warning
KubeletClientCertificateRenewalErrorsKubelet has failed to renew its client certificate.warning
KubeletDownTarget disappeared from Prometheus target discovery.critical
KubeletPlegDurationHighKubelet Pod Lifecycle Event Generator is taking too long to relist.warning
KubeletPodStartUpLatencyHighKubelet Pod startup latency is too high.warning
KubeletServerCertificateExpirationKubelet server certificate is about to expire.warning
KubeletTooManyPodsThe alert fires when a specific node is running >95% of its capacity of pods.info
KubeletServerCertificateRenewalErrorsKubelet has failed to renew its server certificate.warning
KubeMemoryOvercommitCluster has overcommitted memory resource requests.warning
KubeMemoryQuotaOvercommitCluster has overcommitted memory resource requests.warning
KubeNodeNotReadyNode is not ready.warning
KubeNodeReadinessFlappingNode readiness status is flapping.warning
KubeNodeUnreachableNode is unreachable.warning
KubePersistentVolumeErrorsPersistentVolume is having issues with provisioning.critical
KubePersistentVolumeFillingUpPersistentVolume is filling up.warning
KubePodCrashLoopingPod is crash looping.warning
KubePodNotReadyPod has been in a non-ready state for more than 15 minutes.warning
KubeQuotaExceededNamespace quota has exceeded the limits.warning
KubeQuotaAlmostFullNamespace quota is going to be full.info
KubeQuotaFullyUsedNamespace quota is fully used.info
KubeStatefulSetGenerationMismatchStatefulSet generation mismatch due to possible roll-back.warning
KubeStatefulSetReplicasMismatchDeployment has not matched the expected number of replicas.warning
KubeStatefulSetUpdateNotRolledOutStatefulSet update has not been rolled out.warning
KubeStateMetricsListErrorskube-state-metrics is experiencing errors in list operations.critical
KubeStateMetricsShardingMismatchkube-state-metrics pods are running with different --total-shards configuration, some Kubernetes objects may be exposed multiple times or not exposed at all.critical
KubeStateMetricsShardsMissingkube-state-metrics shards are missing, some Kubernetes objects are not being exposed.critical
KubeStateMetricsWatchErrorskube-state-metrics is experiencing errors in watch operations.critical
KubeVersionMismatchDifferent semantic versions of Kubernetes components running.warning
NodeClockNotSynchronisingClock not synchronizing.warning
NodeClockSkewDetectedClock skew detected.warning
NodeFileDescriptorLimitKernel is predicted to exhaust file descriptors limit soon.warning
NodeFilesystemAlmostOutOfFilesFilesystem has less than 5% inodes left.warning
NodeFilesystemAlmostOutOfSpaceFilesystem has less than 5% space left.warning
NodeFilesystemFilesFillingUpFilesystem is predicted to run out of inodes within the next 24 hours.warning
NodeFilesystemSpaceFillingUpFilesystem is predicted to run out of space within the next 24 hours.warning
NodeHighNumberConntrackEntriesUsedNumber of conntrack are getting close to the limit.warning
NodeNetworkInterfaceFlappingNetwork interface device changing it's up status often on node-exporter namespace/pods.warning
NodeNetworkReceiveErrsNetwork interface is reporting many receive errors.warning
NodeNetworkTransmitErrsNetwork interface is reporting many transmit errors.warning
NodeRAIDDegradedRAID Array is degraded.critical
NodeRAIDDiskFailureFailed device in RAID array.warning
NodeTextFileCollectorScrapeErrorNode Exporter text file collector failed to scrape.warning
PodCrashOOMPod Crashing for OOM.critical

Prometheus

AlertDescriptionSeverity
PrometheusBadConfigFailed Prometheus configuration reload.critical
PrometheusDuplicateTimestampsPrometheus is dropping samples with duplicate timestamps.warning
PrometheusErrorSendingAlertsToAnyAlertmanagerPrometheus encounters more than 3% errors sending alerts to any Alertmanager.critical
PrometheusErrorSendingAlertsToSomeAlertmanagersPrometheus has encountered more than 1% errors sending alerts to a specific Alertmanager.warning
PrometheusMissingRuleEvaluationsPrometheus is missing rule evaluations due to slow rule group evaluation.warning
PrometheusNotConnectedToAlertmanagersPrometheus is not connected to any Alertmanagers.warning
PrometheusNotificationQueueRunningFullPrometheus alert notification queue predicted to run full in less than 30m.warning
PrometheusNotIngestingSamplesPrometheus is not ingesting samples.warning
PrometheusOperatorListErrorsErrors while performing list operations in controller.warning
PrometheusOperatorNodeLookupErrorsErrors while reconciling Prometheus.warning
PrometheusOperatorNotReadyPrometheus operator is not ready.warning
PrometheusOperatorReconcileErrorsErrors while reconciling controller.warning
PrometheusOperatorRejectedResourcesResources rejected by Prometheus operator.warning
PrometheusOperatorSyncFailedLast controller reconciliation failed.warning
PrometheusOperatorWatchErrorsErrors while performing watch operations in controller.warning
PrometheusOutOfOrderTimestampsPrometheus drops samples with out-of-order timestamps.warning
PrometheusRemoteStorageFailuresPrometheus fails to send samples to remote storage.warning
PrometheusRemoteWriteBehindPrometheus remote write is behind.critical
PrometheusRemoteWriteDesiredShardsPrometheus remote write desired shards calculation wants to run more than configured max shards.warning
PrometheusRuleFailuresPrometheus is failing rule evaluations.critical
PrometheusTargetLimitHitPrometheus has dropped targets because some scrape configs have exceeded the targets limit.warning
PrometheusTargetSyncFailureThis alert is triggered when at least one of the Prometheus instances has consistently failed to sync its configuration.warning
PrometheusTSDBCompactionsFailingPrometheus has issues compacting blocks.warning
PrometheusTSDBReloadsFailingPrometheus has issues reloading blocks from disk.warning
TargetDownPercentage of jobs and services down in the target namespace above some threshold.warning