Linux Reboot Reason Tracking
Introduction
There are many reasons a device may reboot in the field — whether it be due to a kernel panic, a user reset, or a firmware update.
Within the Memfault UI, reboot events are displayed for each device as well as summarized in the main "Overview" dashboard:
In this guide we will walk through how to use the reboot reason tracking feature from the Memfault Linux SDK to collect this data.
Keep meta-memfault-example
open as a reference
implementation. Your integration should look similar to it once you're done
following the steps in this tutorial.
Prerequisites
The memfaultd
daemon
Follow the integration guide to learn how to set this up
for your device. A key function of memfaultd
is to detect, classify and upload
reboot events to the Memfault platform. This is enabled through the reboot
feature.
Linux kernel pstore
/ ramoops
configuration
The detection of kernel panics as a reboot reason depends on the so called
ramoops
subsystem and pstore
filesystem. From the Linux kernel admin
guide:
Ramoops is an oops/panic logger that writes its logs to RAM before the system crashes. It works by logging oopses and panics in a circular buffer. Ramoops needs a system with persistent RAM so that the content of that area can survive after a restart.
The pstore
is a RAM-backed filesystem that persists across reboots.
The easiest way to enable the pstore
in your kernel when using Yocto is via
the KERNEL_FEATURES
bitbake variable and add the pstore kernel
feature. To get finer-grained control over how the
pstore
is configured, you can instead use its Kconfig
options directly.
We also recommend adding the debug-panic-oops kernel feature to enable a kernel panic when an "oops" is encountered:
KERNEL_FEATURES:append = " cgl/features/pstore/pstore.scc cfg/debug/misc/debug-panic-oops.scc"
In the meta-memfault-example
QEMU integration, the KERNEL_FEATURES
approach
is taken, see these lines of code.
Next, you will need to specify what region of RAM to use. There are several ways
of doing this. The recommended way is to use a Device Tree binding. Please
consult this section on ramoops
parameters in the Linux kernel admin
guide for more details.
Lastly, it's possible to configure the kernel to always dump the kmsg logs
using printk.always_kmsg_dump
. This is expected to
be disabled (the default).
Note that in the meta-memfault-example
QEMU
integration, we are deviating from the
recommendation of using a Device Tree binding. The ramoops.*
kernel command
line arguments are used instead. The reason for this is that QEMU typically
auto-generates a Device Tree on-the-fly and extending it is more complicated.
Systemd configuration
The memfaultd
daemon will take care of cleaning up /sys/fs/pstore
after a
reboot of the system.
Often, systemd-pstore.service is configured to carry out this task. This would
conflict with memfaultd
performing this task. Therefore,
systemd-pstore.service has to be disabled. This service is automatically
excluded when including the meta-memfault
layer.
Note that memfaultd
does not provide functionality (yet) to archive pstore
files (like systemd-pstore.service can). If this is necessary for you, the
work-around is to create a service that performs the archiving and runs before
memfaultd.service
starts up.
Detecting reboot due to Over The Air updates
The OTA agent you are using (for example swupdate
) should inform the Memfault
SDK before rebooting after installing an update.
In our meta-memfault-example
this is accomplished by configuring
swupdate
to run the command
memfaultctl reboot --reason 3
after installing the update.
Reboot Reason Classification
Automatically detected reboot reasons
There are a few reboot reasons that memfaultd can determine with confidence for all embedded Linux devices out-of-the-box:
Reason | Description | Information Source |
---|---|---|
Kernel Panic | The Linux kernel crashed | pstore / ramoops |
User Reset | "Graceful" shutdown of the system | systemd state |
Providing a reboot reason to Memfault
There are many more types of reboots for which the detection can only be implemented in device-specific ways.
For example, an SoC may have special hardware to detect power brownouts. There is no "standard" way to read brownout detection state in Linux.
Another example: a device's software may contain logic to decide that the device needs to shut down or reset, e.g. because the battery has dropped below a certain point, or because the user pressed a button.
To track reboot reasons which the Memfault SDK cannot possibly know how to
detect, the SDK provides a way to extend the reboot reason classification, via
the last_reboot_reason_file
and the memfaultctl reboot
command.
Reboot Reason API
The last_reboot_reason_file
API
The location of this file can be configured using the
reboot.last_reboot_reason_file
key in memfaultd.conf
. Take a look at the
/etc/memfaultd.conf
reference for more information.
During the startup of the system, memfaultd
will attempt to read this file. If
the last_reboot_reason_file
file exists, it is expected to contain the decimal
reboot reason ID (or a custom reboot reason,
see below) that corresponds to the last reboot that
occurred. If the file does not exist, memfaultd
will interpret this as "no
device specific reason known". After reading the file, memfaultd
will
immediately unlink the file.
There are 2 points during a boot cycle where your software may write to this file:
- During system start-up, before
memfaultd
starts: this is the time to check external reason sources, for example, the brownout detection registers of the SoC. Make sure to clear the original source (e.g. the SoC's detection registers), after having written the reboot reason to thelast_reboot_reason_file
. You also need to make sure that this code runs beforememfaultd
starts. The recommended way to do is to addBefore=memfaultd.service
to yoursystemd
unit configuration. In the flowchart below, the yellow box indicate when this code may be run. - During system shut-down: for controlled, "expected" shutdowns and resets
(such as the earlier example of a device's software deciding to shut down due
to a low battery level), it usually makes most sense to write to the
last_reboot_reason_file
during the shut-down sequence. At this point, it does not matter whether thelast_reboot_reason_file
gets written before or aftermemfaultd
shuts down, because it does not touch nor read the file during shutdown. In the flowchart below, the green boxes indicate when this code may be run (pick one).
Rebooting with a reason
To save a device-specific reboot reason and immediately reboot you can use
memfaultctl
.
# memfaultctl reboot --reason <reboot reason (integer or string)>
This command simply writes the reboot reason in the last_reboot_reason_file
and then runs the reboot
command.
Standardized Reboot Reasons
Memfault provides a standard list of reboot reasons. We recommend using them when possible.
Custom Reboot Reasons
You can also define your own reboot reasons. To do so, pass a string to the
--reason
flag of memfaultctl reboot
:
# memfaultctl reboot --reason DarkSideOfTheMoon
You can also write this custom reboot reason directly into the
last_reboot_reason_file
.
If the reboot reason string starts with a !
prefix, it will be treated by
Memfault as an "unexpected reboot". Memfault can use this information to
distinguish between expected and unexpected reboots in some of the dashboards.
Example
The example below shows what a snippet of a Python application may look like
which handles shutting down the device when the battery level goes below 10%.
When this happens, "4"
, the reboot reason ID for "Low
Power", is written to the /media/last_reboot_reason
file before it shuts down the device:
# This function is run periodically to check the battery level:
def check_battery_level():
level_percent = battery.read_level()
if level_percent < 10:
with open("/media/last_reboot_reason", "w") as file:
file.write("4") # 4 == "Low Power"
shutdown_gracefully()
# ...
Reboot Reason Precedence
As described in the previous sections, there are multiple sources of information where reboot reasons may come from. In case there are multiple reasons found, this order of precedence is used:
pstore
/ramoops
(Kernel Panic)last_reboot_reason_file
systemd
shutdown state (User Reset)
If none of the sources determined a reason for a reboot, "Unspecified" will be used as the reason.
To aid debugging, memfaultd
will log the source of the reboot reason that was
used as well as any sources found that provided a reason, but got discarded
because another source took precedence over it:
$ journalctl -u memfaultd
Nov 15 09:15:36 qemuarm64 systemd[1]: Starting memfaultd daemon...
Nov 15 09:15:36 qemuarm64 memfaultd[184]: reboot:: Using reboot reason ButtonReset (0x0006) from custom source for boot_id 847c0502-9f47-43d0-b58d-7029b30f9cd0
Nov 15 09:15:36 qemuarm64 memfaultd[184]: reboot:: Discarded reboot reason UserReset (0x0002) from internal source for boot_id 847c0502-9f47-43d0-b58d-7029b30f9cd0
...
Enqueuing & Upload
Once memfaultd
determines the reboot reason, it will enqueue a reboot event
for upload to the Memfault cloud. Note that uploading does not happen
immediately. Instead, the queue of events is uploaded periodically. See the
memfaultd.conf
configuration key
upload_interval_seconds
for more information.
Ingestion of Reboots Events may be rate-limited.
Process Flow
The flowchart below documents the process of reboot reason tracking and puts all the topics discussed together.
Test your integration
You can easily test your integration with the
memfaultctl reboot --reason <reason>
command. It accepts an integer reboot
reason (defaults to 0 UNKNOWN_REASON
), will save the
reason and reboot the system (using the reboot
command).
$ memfaultctl reboot --reason 4 # Low Power reboot
After reboot, to immediately flush the reboot event to Memfault, you can use:
$ memfaultctl sync