Skip to main content

Metric Alerts

The Alerts feature will send a notification when heartbeat events satisfy a given threshold for a specific set of filters. To find the feature, click the Alerts item in the sidebar.

Fleet-threshold Alerts

Fleet-threshold Alerts operate on aggregate Metrics conditions. For example, you could configure a Fleet-threshold Alert that triggers when your production Cohort surpasses a mean count of 5 connection failures in a time window of one hour.

Only Metrics marked as Timeseries can be used in the condition for Fleet-threshold Alerts.

Fleet threshold alerts are evaluated every hour. You will receive a notification if the aggregate meets the condition for data points in the preceding hour.

Device-threshold Alerts

Device-threshold Alerts trigger when any Device reports a certain value or range of values for a Metric.

Any type of Metrics can be used in the condition for Device-threshold Alerts.

Whenever the Memfault cloud receives a heartbeat event with metrics from a device, it will evaluate the applicable Device-threshold Alerts and create Incidents, if triggered. Depending on the configuration of the Alert, notifications can be sent immediately as an Incident starts, or as a scheduled summary of new Incidents.

Incident creation and resolution

The creation and resolution of an incident is controlled by the creation (incident start) and resolution (incident end) delays in the alert configuration. You can use these values to reduce incident noise.

  • The incident start delay is the time that must pass before an incident is declared after the metric condition is met.

    For example, if a value sometimes briefly goes above the threshold but goes back to normal within an hour, you can set the alerting delay to 1 hour 30 minutes to avoid creating an incident for a brief spike.

    An incident will only be created if either another matching value comes in after the start delay has passed, or if no data is received for at least the start delay period.

    When an incident is created, it will appear as ALERTING in the incident list.

  • The incident end delay is the time that must pass before an incident is resolved after the metric condition is no longer met.

    For example, if you see that your alert frequently gets resolved and then starts firing again shortly after, you can set an end delay to keep the incident in the "ALERTING" state unless a new value comes in after the end delay has passed.

    Unlike in the case of the start delay, if a device reports no new data, the incident will stay in the ALERTING state until a new value comes in.

    Once the incident is resolved, it will appear as RESOLVED in the incident list.

info

For alerts that were created before the availability of the delay feature, you should see "No delay" for the start delay, and an end delay of 1 day. This is to match previous behaviour where all alerts had a maximum of 1 incident per 24 hour period. Please set this to a reasonable value for your specific alerts.

Incident start and end delays illustrated

Basic example

Here's an example of an alert with a start delay of 15 minutes and an end delay of 30 minutes. The metric threshold is set to battery_discharge > 50.

Graph of a metric with the incident start and end illustrated
  • At (a) 0:30 h, a value above the threshold is captured, but the incident is not created yet. A new value below the threshold is captured at 0:45 h, so this momentary spike is ignored.

  • At (b) 2:15 h, another value above the threshold is captured. Another above-threshold value comes in at 2:30 h, which is when the incident is created. This is because the start delay of 15 minutes has passed.

    If you enable "notify: when an incident starts", the configured targets will be notified when this value is received by Memfault.

  • At (c) 3:00 h, a value below the threshold is captured. The incident is not resolved yet, because the end delay of 30 minutes has not passed, and a new value above the threshold is captured shortly after, keeping the incident in the ALERTING state.

  • At (d) 3:30 h, another value below the threshold is captured. The incident is resolved when an additional below-threshold value is captured at 4:00 h, because the end delay of 30 minutes has passed.

    If you enable "notify: when an incident is resolved", the configured targets will be notified when this value is received by Memfault.

When no data is received

In this next example, we illustrate what happens when no data is received by Memfault.

Graph of a incomplete data capture with the incident start and end illustrated
  • This is similar to the previous example at first, but after the event at (b) 2:15 h is captured, the device stops sending data for a while (for example, because it was shut down for a while).

  • In this case, an incident is automatically created after the start delay has passed since the receipt of the last value, i.e. 15 minutes for this alert (the difference between capture and receive time is explained below).

  • Note that the incident is not automatically resolved after the end delay of 30 minutes has passed. This is because no new data has been received confirming that the device was behaving as expected.

  • When the device is seen again at (c) 4:15 h with a value below the threshold, the incident is resolved as it has been more than 30 minutes since the (b) 2:15 h value.

Capture vs. Receive Time

Your devices may report data in realtime or near-realtime, in batches, or with significant latency. To handle a diverse set of device heartbeat intervals, incident start and end delays are calculated based on the capture timestamp and not the received timestamp, except in the following cases:

  • If no data is received after a value that crosses the threshold (like in the example above), the start delay is evaluated based on the last received timestamp, and an incident is automatically created once it has passed.

  • If devices send timestamps in the future, they are treated as if they were captured at the time of receipt.

  • If no capture timestamp is set, the received timestamp is used instead.

  • Additionally, if data is received out of order, any values that are received with a capture timestamp before the last received timestamp are ignored.

In any case, setting a start or end delay shorter than the heartbeat or batch interval of the device would behave similarly to setting no delay at all. A good rule of thumb to prevent noise is to use a delay, at minimum, slightly longer than the heartbeat interval.

info

By default, events captured by the MCU SDK are timestamped when they're uploaded to Memfault (on Linux and AOSP, events are always timestamped on-device using the device's RTC clock). This works well for frequently-connected devices, but for devices that batch-upload many heartbeats at once, this would result in all of the events in a batch being timestamped with the same timestamp. You can set an event timestamp using the MCU SDK.

The default MCU heartbeat interval is 3600 seconds (1 hour). Look at the MCU metrics docs for more context.

Creating a Device-threshold Alert

Let's create an alert when the battery_perc (battery percentage) is below 5%, for devices with serial numbers starting with MFLT.

  • Click Alerts in the sidebar, and Create Alert in the top right.
  • Provide a Title and optional Description (highly advised)!
  • Set a Condition value from a list of heartbeat metrics, or type the string key to find it.
    • We set the condition value to is less than 5.
    • We also wanted to make sure that the Device Serial values start with MFLT.
    • To do this we click (+) and condition, and specify these values.
  • Next choose which targets should receive these Alerts. Optionally, click "Manage targets" to navigate to the settings page to manage your targets.
  • Next, optionally set up incident start and end delays. Look at the section above above for more details.
  • Finally, specify when to receive notifications for these alerts.
    • To be notified as soon as an incident is created, select "when a new incident starts". Please note that the incident creation depends on the alerting delay.
    • To be notified immediately when the alert is resolved, select "when an incident is resolved". Please note that the incident resolution depends on the resolution delay.
    • To receive a summary of new & active incidents, select "a scheduled summary of incidents..." and set up your preferred schedule.
      • The timezone is prefilled according to your browser or custom timezone, but you can modify it as you need.
      • You can have a minimum of 1 and maximum of 4 summary schedules configured per alert.
Creating an Alert in the Memfault app
  • Above, we have set up our Alert! Click OK to create it.

Incident Details UI

Once the Alert is set up, Memfault will watch for Events matching the criteria specified. If it finds any, it will create Incidents, which can be found by clicking on the Alert entry in the Alerts table.

Expect alerts to be delivered in the UI almost immediately under Incidents. Navigate to Alerts and select the Alert you want. Incidents are listed per device.

Screenshot of the Memfault web app showing the details of a specific incident and the metric payload that last triggered it