Skip to main content

Linux Reboot Reason Tracking

Introduction

There are many reasons a device may reboot in the field — whether it be due to a kernel panic, a user reset, or a firmware update.

Within the Memfault UI, reboot events are displayed for each device as well as summarized in the main "Overview" dashboard:

Reboots chart in the 'Overview' dashboard of the Memfault Web App

Reboots chart in the "Overview" dashboard of the Memfault Web App.

In this guide we will walk through how to use the reboot reason tracking feature from the Memfault Linux SDK to collect this data.

tip

Keep meta-memfault-example open as a reference implementation. Your integration should look similar to it once you're done following the steps in this tutorial.

Prerequisites

The memfaultd daemon

Follow the integration guide to learn how to set this up for your device. A key function of memfaultd is to detect, classify and upload reboot events to the Memfault platform. This is enabled through the reboot feature.

Linux kernel pstore / ramoops configuration

The detection of kernel panics as a reboot reason depends on the so called ramoops subsystem and pstore filesystem. From the Linux kernel admin guide:

Ramoops is an oops/panic logger that writes its logs to RAM before the system crashes. It works by logging oopses and panics in a circular buffer. Ramoops needs a system with persistent RAM so that the content of that area can survive after a restart.

The pstore is a RAM-backed filesystem that persists across reboots.

The easiest way to enable the pstore in your kernel when using Yocto is via the KERNEL_FEATURES bitbake variable and add the pstore kernel feature. To get finer-grained control over how the pstore is configured, you can instead use its Kconfig options directly.

We also recommend adding the debug-panic-oops kernel feature to enable a kernel panic when an "oops" is encountered:

KERNEL_FEATURES:append = " cgl/features/pstore/pstore.scc cfg/debug/misc/debug-panic-oops.scc"

In the meta-memfault-example QEMU integration, the KERNEL_FEATURES approach is taken, see these lines of code.

Next, you will need to specify what region of RAM to use. There are several ways of doing this. The recommended way is to use a Device Tree binding. Please consult this section on ramoops parameters in the Linux kernel admin guide for more details.

Lastly, it's possible to configure the kernel to always dump the kmsg logs using printk.always_kmsg_dump. This is expected to be disabled (the default).

note

Note that in the meta-memfault-example QEMU integration, we are deviating from the recommendation of using a Device Tree binding. The ramoops.* kernel command line arguments are used instead. The reason for this is that QEMU typically auto-generates a Device Tree on-the-fly and extending it is more complicated.

Systemd configuration

The memfaultd daemon will take care of cleaning up /sys/fs/pstore after a reboot of the system.

Often, systemd-pstore.service is configured to carry out this task. This would conflict with memfaultd performing this task. Therefore, systemd-pstore.service has to be disabled. This service is automatically excluded when including the meta-memfault layer.

Note that memfaultd does not provide functionality (yet) to archive pstore files (like systemd-pstore.service can). If this is necessary for you, the work-around is to create a service that performs the archiving and runs before memfaultd.service starts up.

Detecting reboot due to Over The Air updates

The OTA agent you are using (for example swupdate) should inform the Memfault SDK before rebooting after installing an update.

In our meta-memfault-example this is accomplished by configuring swupdate to run the command memfaultctl reboot --reason 3 after installing the update.

Reboot Reason Classification

Automatically detected reboot reasons

There are a few reboot reasons that memfaultd can determine with confidence for all embedded Linux devices out-of-the-box:

ReasonDescriptionInformation Source
Kernel PanicThe Linux kernel crashedpstore / ramoops
User Reset"Graceful" shutdown of the systemsystemd state

Providing a reboot reason to Memfault

There are many more types of reboots for which the detection can only be implemented in device-specific ways.

For example, an SoC may have special hardware to detect power brownouts. There is no "standard" way to read brownout detection state in Linux.

Another example: a device's software may contain logic to decide that the device needs to shut down or reset, e.g. because the battery has dropped below a certain point, or because the user pressed a button.

To track reboot reasons which the Memfault SDK cannot possibly know how to detect, the SDK provides a way to extend the reboot reason classification, via the last_reboot_reason_file and the memfaultctl reboot command.

Reboot Reason API

The last_reboot_reason_file API

The location of this file can be configured using the reboot.last_reboot_reason_file key in memfaultd.conf. Take a look at the /etc/memfaultd.conf reference for more information.

During the startup of the system, memfaultd will attempt to read this file. If the last_reboot_reason_file file exists, it is expected to contain the decimal reboot reason ID (or a custom reboot reason, see below) that corresponds to the last reboot that occurred. If the file does not exist, memfaultd will interpret this as "no device specific reason known". After reading the file, memfaultd will immediately unlink the file.

There are 2 points during a boot cycle where your software may write to this file:

  • During system start-up, before memfaultd starts: this is the time to check external reason sources, for example, the brownout detection registers of the SoC. Make sure to clear the original source (e.g. the SoC's detection registers), after having written the reboot reason to the last_reboot_reason_file. You also need to make sure that this code runs before memfaultd starts. The recommended way to do is to add Before=memfaultd.service to your systemd unit configuration. In the flowchart below, the yellow box indicate when this code may be run.
  • During system shut-down: for controlled, "expected" shutdowns and resets (such as the earlier example of a device's software deciding to shut down due to a low battery level), it usually makes most sense to write to the last_reboot_reason_file during the shut-down sequence. At this point, it does not matter whether the last_reboot_reason_file gets written before or after memfaultd shuts down, because it does not touch nor read the file during shutdown. In the flowchart below, the green boxes indicate when this code may be run (pick one).

Rebooting with a reason

To save a device-specific reboot reason and immediately reboot you can use memfaultctl.

# memfaultctl reboot --reason <reboot reason (integer or string)>

This command simply writes the reboot reason in the last_reboot_reason_file and then runs the reboot command.

Standardized Reboot Reasons

Memfault provides a standard list of reboot reasons. We recommend using them when possible.

Custom Reboot Reasons

You can also define your own reboot reasons. To do so, pass a string to the --reason flag of memfaultctl reboot:

# memfaultctl reboot --reason DarkSideOfTheMoon

You can also write this custom reboot reason directly into the last_reboot_reason_file.

tip

If the reboot reason string starts with a ! prefix, it will be treated by Memfault as an "unexpected reboot". Memfault can use this information to distinguish between expected and unexpected reboots in some of the dashboards.

Example

The example below shows what a snippet of a Python application may look like which handles shutting down the device when the battery level goes below 10%. When this happens, "4", the reboot reason ID for "Low Power", is written to the /media/last_reboot_reason file before it shuts down the device:

# This function is run periodically to check the battery level:
def check_battery_level():
level_percent = battery.read_level()
if level_percent < 10:
with open("/media/last_reboot_reason", "w") as file:
file.write("4") # 4 == "Low Power"
shutdown_gracefully()
# ...

Reboot Reason Precedence

As described in the previous sections, there are multiple sources of information where reboot reasons may come from. In case there are multiple reasons found, this order of precedence is used:

  1. pstore/ramoops (Kernel Panic)
  2. last_reboot_reason_file
  3. systemd shutdown state (User Reset)

If none of the sources determined a reason for a reboot, "Unspecified" will be used as the reason.

note

To aid debugging, memfaultd will log the source of the reboot reason that was used as well as any sources found that provided a reason, but got discarded because another source took precedence over it:

$ journalctl -u memfaultd
Nov 15 09:15:36 qemuarm64 systemd[1]: Starting memfaultd daemon...
Nov 15 09:15:36 qemuarm64 memfaultd[184]: reboot:: Using reboot reason ButtonReset (0x0006) from custom source for boot_id 847c0502-9f47-43d0-b58d-7029b30f9cd0
Nov 15 09:15:36 qemuarm64 memfaultd[184]: reboot:: Discarded reboot reason UserReset (0x0002) from internal source for boot_id 847c0502-9f47-43d0-b58d-7029b30f9cd0
...

Enqueuing & Upload

Once memfaultd determines the reboot reason, it will enqueue a reboot event for upload to the Memfault cloud. Note that uploading does not happen immediately. Instead, the queue of events is uploaded periodically. See the memfaultd.conf configuration key upload_interval_seconds for more information.

Rate Limiting

Ingestion of Reboots Events may be rate-limited.

Process Flow

The flowchart below documents the process of reboot reason tracking and puts all the topics discussed together.

Flowchart of the Reboot Reason Classification by memfaultd

Flowchart of the Reboot Reason Classification by memfaultd.

Test your integration

You can easily test your integration with the memfaultctl reboot --reason <reason> command. It accepts an integer reboot reason (defaults to 0 UNKNOWN_REASON), will save the reason and reboot the system (using the rebootcommand).

# memfaultctl reboot --reason 4    # Low Power reboot

After reboot, to immediately flush the reboot event to Memfault, you can use:

# memfaultctl sync