Alarms & notifications
CORE-M models alarms as first-class, tenant-scoped entities — not just log lines. An alarm has a lifecycle, an owner, a history of comments and state changes, and it drives notifications. This page covers the alarm entity and its lifecycle, then the notification system that routes alarms to people and systems.
Alarms are created by rule chains, device state
changes, protocol adapters, or system monitors, and are stored in Aerospike
namespace rules, set alarms, keyed by {tenant_id}:{alarm_id}.
The alarm entity
Section titled “The alarm entity”An alarm record carries alarm_id, tenant_id, originator_type, originator_id,
type, severity, status, assignee_user_id, start_ts, end_ts,
acknowledged_at, cleared_at, details_json, latest_event_hash, created_at,
updated_at, and version. Secondary indexes support filtering by tenant_id,
originator_id, severity, status, type, and the timestamps — which powers the
alarm-center filters.
When a create-alarm node fires (say device D1 reports temperature 95), an alarm is
created with status="active_unack", the configured severity, originator_id="D1",
and type from the rule. The platform then:
- Publishes
alarm.created.T1.{alarm_id}. - Pushes
alarm_updateto WebSocket clients subscribed to tenant alarms. - Updates
corem_alarms_total{severity,status}.
Lifecycle
Section titled “Lifecycle”An alarm has two orthogonal dimensions: active vs cleared (the condition) and unacknowledged vs acknowledged (operator attention). The four status values combine them:
| Status | Condition | Operator |
|---|---|---|
active_unack | Still active | Not yet acknowledged |
active_ack | Still active | Acknowledged |
cleared_unack | Cleared | Not yet acknowledged |
cleared_ack | Cleared | Acknowledged |
Conceptually this is the familiar triggered → acknowledged → cleared flow, with the twist that a cleared alarm can reopen if the condition recurs.
stateDiagram-v2 [*] --> active_unack: alarm created active_unack --> active_ack: acknowledge active_unack --> cleared_unack: clear active_ack --> cleared_ack: clear cleared_unack --> cleared_ack: acknowledge cleared_ack --> active_unack: condition recurs (reopen) cleared_unack --> active_unack: condition recurs (reopen)
-
Acknowledge. A user with the
alarmspermission acknowledges the alarm, optionally with a comment (“Investigating”). Status moves toactive_ack;acknowledged_byandacknowledged_atare set; the comment is appended to history; audit eventalarm.acknowledgedis written. -
Clear. Clearing with a resolution (“Fan replaced”) moves status to
cleared_ack, recordscleared_by,cleared_at, andresolution, and publishesalarm.cleared.T1.{alarm_id}. -
Reopen. If the same alarm condition triggers again within the deduplication window, the alarm reopens to
active_unack,reopened_countincrements, and the previous clear metadata remains in history.
Deduplication and severity escalation
Section titled “Deduplication and severity escalation”Alarms are deduplicated by originator and type. If an alarm is already active for
(tenant, device, type), a repeat trigger updates the existing alarm rather than
spawning a new one: latest_event_hash and updated_at change, repeat_count
increments, and alarm.updated.T1.{alarm_id} is published. This keeps a noisy
sensor from flooding the alarm center with duplicates.
Severity escalates on repeat. If an active warning alarm receives a new trigger
at critical severity, the alarm’s severity becomes critical, escalation_count
increments, and the critical-alarm notification rules are re-evaluated.
Comments, assignment, and history
Section titled “Comments, assignment, and history”Every lifecycle transition — acknowledge, clear, reopen, comment, assignment — is recorded in the alarm’s history, and notable transitions emit audit events. This gives operators a complete, append-only record of who did what and when.
Concurrent acknowledgement (CAS)
Section titled “Concurrent acknowledgement (CAS)”Lifecycle transitions use Aerospike generation checks (compare-and-swap) to
prevent lost updates. If two operators load alarm A1 at version 4 and both try to
acknowledge:
- The first write succeeds.
- The second, still holding stale version 4, is rejected with
ABORTED/ HTTP 409, and the response returns the current version and status so the UI can refresh.
The same per-record CAS applies to bulk actions: acknowledging 25 alarms at once
applies CAS per alarm and reports failures per alarm_id, so the UI can show partial
success when some alarms changed concurrently.
Notifications
Section titled “Notifications”Alarms drive notifications: messages delivered to people or systems. Notification
targets, templates, and rules are tenant-scoped and stored in Aerospike namespace
rules, set notifications, keyed by {tenant_id}:{notification_id}.
Targets
Section titled “Targets”| Target type | Notes |
|---|---|
| SMTP delivery | |
| webhook | Generic HTTP POST |
| Slack-compatible webhook | Incoming-webhook style |
| SMS provider | Via configured SMS gateway |
| in-app notification | Surfaced inside CORE-M |
| Redpanda event | Published for downstream consumers |
Templates
Section titled “Templates”Templates render the message body with variable substitution from alarm and
device fields. Templates are versioned: editing a template versions the previous
content and writes an audit event with template_id, old_version, new_version,
and actor_user_id.
A notification rule ties it together: a matcher (e.g. severity="critical" and
status="active_unack"), one or more targets, a template, plus quiet hours,
escalation, suppression, and a deduplication window. When an alarm matches a rule, a
notification record is created with status="scheduled",
notification.scheduled.T1.{notification_id} is published, and the worker renders
the template.
Delivery and retry
Section titled “Delivery and retry”flowchart TD
match([Alarm matches rule]) --> sched["status = scheduled"]
sched --> send{Deliver to target}
send -->|accepted| delivered["status = delivered<br/>record delivered_at,<br/>provider_message_id"]
send -->|error| retry{Retries left?}
retry -->|yes| backoff["Exponential backoff,<br/>retry per rule policy"]
backoff --> send
retry -->|no| failed["status = failed<br/>store failure_reason"]
On success the notification becomes delivered, recording delivered_at and the
provider_message_id, and publishing notification.delivered.T1.{notification_id}.
On failure the worker retries with exponential backoff per the rule’s policy; after
max attempts the status becomes failed, failure_reason stores the final provider
error, and corem_notification_delivery_failures_total{target_type} increments.
Quiet hours, escalation, and suppression
Section titled “Quiet hours, escalation, and suppression”Notification rules support several controls to keep alerting useful rather than noisy:
- Quiet hours. Defer non-critical notifications during a window (e.g. 22:00–
07:00 tenant local time). A
warningalarm at 23:00 produces adeferrednotification that is held until 07:00, with no provider call made in the meantime. - Critical bypass. A rule can allow
criticalalarms to bypass quiet hours — a critical alarm during the quiet window is scheduled immediately. - Escalation. If an alarm stays
active_unackpast a threshold (e.g. 15 minutes), the escalation scheduler schedules the next level’s notification (e.g. level 2 → “oncall-manager”) and recordsescalation_levelin the alarm history. - Suppression and dedup windows. Per-target suppression and deduplication windows keep repeated triggers from generating repeated notifications.
Preferences and audit
Section titled “Preferences and audit”Where tenant policy allows, users can set personal notification preferences — for
example opting out of email while keeping in-app notifications — and the change is
audited (notification.preference.updated). Every notification rule change and
delivery state change produces an audit event, so the full alerting configuration and
its delivery outcomes are traceable.