Memfault's Linux SDK reached
version 1.0 when introducing support for Coredumps: In
the event of a crash of any process on the system, memfaultd produces a memory
dump that will be uploaded to Memfault for further processing to allow for
detailed debugging across the fleet.
Together with the already existing support for OTA, metrics, and reboot reasons,
Memfault now offers all its essential features on Linux devices!
The documentation of Memfault's Linux SDK was
extended even further to explain the integration steps as well as the growing
number of configuration options.
Memfault's Linux support reaches another milestone: Devices can now report
metrics and diagnostic data to measure the success of software updates (OTA) and
to proactively diagnose anomalies before users even experience their effect. The
Memfault Linux SDK 0.3.0 ships
with a configurable set of
plugins for collectd to obtain
standard KPIs at the operating system level (e.g. available storage or RAM, CPU
utilization, or network status and traffic). You can also use the SDK to
collect product-specific custom metrics via statsd.
When sent to the cloud, all telemetry data is being processed and distilled to
fleet-wide time-series metrics (e.g. "was
there an uptick in avg. CPU usage since the last version?"),
device attributes (e.g. "which devices at
site B ran for more than 6 months already without reboot?"), and detailed
per-device insights via the Timeline UI (e.g. "are there any anomalies on the
network traffic that correlate with crashes reported for this device?").
Memfault's Device Timeline provides a view for each device's metrics, reboots
and crashes to make debugging easier. With the recent performance improvements,
Device Timeline now renders considerably more time-series metrics simultaneously
and expandable "Panels" group relevant metrics together for ease of use (see
Value History, Foreground, Longwakes below).
Memfault extends its features on embedded Linux
toward basic fleet operations. You can now measure basic fleet-wide health
metrics by tracking reboots and their cause at scale. Similar to Memfault's
MCU and Android SDKs,
there is now a dedicated
Memfault Linux SDK 0.2.0 with source code
including examples. The SDK repository comes with Docker images including QEMU
support to simplify the first steps.
As part of the SDK, a new on-device agent memfaultd orchestrates the
configuration of related components such as SWUpdate for OTA. It will act as a
minimal yet central component in future releases for features such as metrics
and crash reporting.
Memfault's charts can now be normalized
to convert absolute values such as "number of incidents", sums, and counts to
corresponding relative values "per 1,000 devices". This helps in understanding
real trends when you are looking at values over time or when comparing values
between populations of different sizes.
This feature also works with custom metric charts for any custom metric. It is
particularly helpful to measure the success of an ongoing OTA update by
comparing devices from large production Cohort "Default" against those from a
smaller test Cohort "Beta". Chart normalization is also generally useful when
the population size changes over time (e.g. new devices being activated
continuously).
Memfault improved its
notification system and how notifications will be sent
on Alerts. For each individual Alert, you can now
decide which team members, external systems, or groups thereof should receive an
email. All members of @team-maintenance may want to receive notifications
about devices with an abnormal battery discharge rate while a spike in
connectivity issues on the "Beta" Cohort may only be relevant in the
#beta-release Slack channel.
At
Settings → Notifications,
there are extensive options to customize the @userhandle for any team member
to connect Memfault to dedicated Slack channels or any other external system
(e.g. PagerDuty, Opsgenie) by registering external email addresses. Any
combination of these
User Handles and External Targets can be added to a Notification Group
and used to control how to notify per Alert.
Memfault's over-the-air update service is now
available on Embedded Linux with SWUpdate via the
hawkBit DDI. This makes all Memfault
OTA management and hosting features such as Cohorts, staged rollouts, full vs.
delta releases, and a scalable global CDN available to Linux devices that
utilize one of the most popular update agents. Memfault also added support for
forced (non-interactive) updates – invaluable for delivering security updates to
embedded IoT devices.
Memfault surfaces relevant insights about your fleet via charts. They receive
updates regularly to be more comprehensive and useful.
The dashboard chart "Active Devices" was redesigned to be a bar chart to
communicate its underlying data more clearly: "Devices that communicated with
Memfault in a given time period". Tooltips on several charts better highlight
the numerical values their lines and bars represent. The chart "Reboot Reasons"
also allows for drill-down on any day by clicking on the respective labels.
Memfault
organizes your data
around the concept of devices. An investigative
search for matching devices
is a key activity when analyzing data around device attributes and time-series
metrics. Memfault's device search also acts as a hub when clicking on details on
charts, OTA management, or as part of crash analysis. It allows you to precisely
describe a population of devices before performing follow-up tasks such as
assigning them to a specific Cohort or describing a new Device Set.
We made significant improvements to the device search! Most fields now accept
multiple values ("OR") and can be repeated where applicable ("AND"). You can
search for device attributes and time-series data at the same time and it is
even possible to search for historical data across different time ranges at
once.
When combined, you can now conveniently describe (and store as
Device Set) populations such as
"Devices on v1.2 that were charged >90% earlier this week but rebooted due to
low power yesterday" – to validate presumed bug fixes via operational
data, or
"Devices with attributes was_reworked: true, shipment_state:
"back to us" and assignee: "John"" – to use
custom device attributes for inventory
management
Memfault's AOSP SDK "Bort" received a major
update 4.0 with many improvements
such as additional support for Android 12 and multi-user compatibility.
We added support for Custom Metrics so
that devices can report product-specific numerical values, strings, and state
transitions on regular intervals. Together with a growing set of
built-in Metrics, this leads to a
powerful combination of per-device debugging (e.g. correlation of CPU
activity and battery voltage) and fleet-wide insights (e.g. "how many
devices exceed 80% of their storage" or "What is the average battery discharge
rate?"). The reported values contribute to the recently introduced
Time-Series Metrics and Device Attributes.
Another rather advanced feature enables devices without direct Internet
connection to report crashes and metrics to Memfault. Data is packaged as .mar
(Memfault Archive) files and vendors may upload them at a later time or upon
request (e.g. downloaded periodically via USB or in a local network by an
auxiliary device). This allows vendors to use Memfault in scenarios with strict
security requirements.
Memfault's Dashboard provides an
overview of your fleet at a glance. We have updated the charts "Active Devices"
(more sources used as signal) and "Software Versions" (only active devices
considered) to better compare apples to apples.
The visible time range for "Software Versions" can now be changed from "2
months" all the way down to just "24 hours". Since the same chart is now being
used on the Cohort details page, it not only
allows you to see long-term trends, but also acts as timely signal to observe
the effect of an ongoing OTA software rollout.
The charts "Seen Devices" and "Received Events" have been removed.
Improved OTA with Version Matrix and Delta Releases
Memfault steps up
OTA observability and release management. Use the new
Version Matrix (Fleet → Cohorts
→ Cohort Details) to learn about the version distribution of your devices (rows)
and which version your devices will be updated to via OTA (columns).
Changes to your software rollout (including
percentages for staged rollouts)
are reflected in the matrix immediately. This is especially helpful when using
the newly introduced Delta Releases that
describe a path between specific versions. Every software version, OTA release,
and number in the matrix is clickable to get to more details if needed.
Even complex and unusual scenarios are visible at a glance: devices with no
compatible OTA payload, devices with a higher version than the Cohort's target
release, must-pass-through releases and their effect, as well as many other edge
cases are represented with data that is always live.
The Version Matrix gives you confidence both while preparing your software
rollout and during the ongoing rollout, and it still helps you understand the
version distribution of your fleet when no further updates are planned.
You can now observe the development of your device Fleet even better via fully
customizable Device Sets
(Fleet→Device Sets). Based on your project-specific
Device Attributes and Time-Series Metrics each set
describes devices that match unique criteria such as "devices that exceeded 40MB
of daily network traffic" or "devices in the UK".
You can create a Device Set from any existing device search and use them to
track key performance indicators over time.
The new chart type Issue Chart
complements the custom Metrics Chart (Dashboards→Metrics) and plots the number
of occurrences of a specific group of issues over time. When combined with the
recently introduced
Chart Comparison Mode, Issue Charts
allow for sophisticated tracking of ongoing software updates to answer the
relevant question "does my software update fix the bugs it claims to fix?"