Skip to main content

Watchdog Instrumentation

Prerequisite

This guide assumes you have already completed the minimal integration of the Memfault SDK. If you have not, please complete the appropriate Getting Started guide.

This document describes the necessary integration details to capture coredumps when a Watchdog occurs.

An example of a captured Watchdog crash below:

Enabling a Watchdog Timer

This guide assumes a Watchdog Timer is already present and enabled on the system. For some general background on Watchdog Timers and recommendations on how to use them, see this article:

https://interrupt.memfault.com/blog/firmware-watchdog-best-practices

Instrumenting Watchdogs with Memfault

When using the Memfault SDK, we recommend instrumenting the watchdog timer so a coredump will be captured when the watchdog trips.

The fundamental technique is to use Memfault's Watchdog interrupt handler to trap the interrupt when the watchdog fires- this provides the best backtrace result, since it uses the Memfault interrupt shim.

To use it, make sure that the MEMFAULT_EXC_HANDLER_WATCHDOG Memfault SDK config is set to the name of the Watchdog interrupt handler registered in your interrupt vector table (this config is usually set in the memfault_platform_config.h header file), and there is no competing implementation.

In order for the coredump save to complete, the chip will need to be configured to fire the interrupt for the Watchdog Timer, but not immediately reset.

Configuring Watchdog Timer Warning Threshold

Depending on the particular chip, there's often a "warning" threshold that can be configured into the watchdog, which will trigger an interrupt some period before the watchdog reset occurs.

For example, on the NXP RT1021 chip (from the RT1020 Reference Manual):

If available, be sure to set up the warning threshold at a sufficiently long timeout for the coredump saving to have enough time to finish (for example, if there are long flash page erase times that need to be accounted for).

We typically recommend a relatively long timeout, such as 10 seconds, but be sure to consult the flash chip documentation to ensure enough time is reserved.

If the hardware watchdog does not permit a "warning" threshold of sufficient time, it may instead make sense to set the hardware watchdog to a very long period, and use a timer peripheral interrupt to trigger the Memfault software watchdog.

Example Sequence Diagram

Normally the coredump saving will complete before the full watchdog timeout, and the Memfault fault handler will reset the chip. The following sequence diagram illustrates this:

Example Sequence Diagram with Hardware Watchdog Triggering Reset

If the coredump saving takes too long, the hardware watchdog will trigger a reset instead, causing a lost coredump. In this case, Memfault saves a Trace Event to indicate that the watchdog reset occurred and failed to save the coredump, which will generate an Issue:

Petting the Watchdog Prior to Coredump Save

The actual writing of the coredump to the platform storage medium is the time consuming part of the coredump capture.

One technique that may be useful is to apply one last reset of the watchdog timer (often called "petting" the watchdog) prior to the coredump save operation. The place to insert the watchdog timer reset is in this weakly-defined function:

//! Called prior to invoking any platform_storage_[read/write/erase] calls upon crash
//!
//! @note a weak no-op version of this API is defined in memfault_coredump.c because many platforms will
//! not need to implement this at all. Some potential use cases:
//! - re-configured platform storage driver to be poll instead of interrupt based
//! - give watchdog one last feed prior to storage code getting called it
//! @return true if to continue saving the coredump, false to abort
extern bool memfault_platform_coredump_save_begin(void);

Software Watchdog

For other watchdog types that don't trigger a hardware interrupt, the Memfault assert macro MEMFAULT_SOFTWARE_WATCHDOG() can be used to trigger a coredump tagged as Software Watchdog.

caution

If MEMFAULT_SOFTWARE_WATCHDOG() is called from a watchdog interrupt handler, the Memfault interrupt shim will not be used, and the backtrace will not be as accurate. Instead, it's preferable to wire up the MEMFAULT_EXC_HANDLER_WATCHDOG interrupt handler directly.

One common use case for this feature is a "task watchdog" in a multi-threaded system, which checks for tasks that aren't busy-looping, but are stuck waiting on some OS primitive (mutex or queue) when they should be proceeding.

Call the MEMFAULT_SOFTWARE_WATCHDOG() macro when the task watchdog trips. Note that in order to diagnose such an issue, Memfault needs to support thread awareness for the RTOS in use, and the necessary threads and kernel data structures must be captured as part of the coredump (see the RTOS Analysis documentation for details.)

note

Memfault provides a lightweight task watchdog subsystem which can be used if the system doesn't have one available. See here for more information.

Sample Integrations

Memfault provides several example implementations of software watchdogs, which can be used as-is or as a reference for your own implementation: