memfaultd Built-in Metrics
Built-in System Metric Collection
With built-in system Metric collection enabled, memfaultd captures readings on
the state and health of your Device which are aggregated as part of Metric
Reports uploaded to Memfault.
When built-in system Metric collection is enabled, readings from collectd that
share a top-level namespace with one of the below Metric groups will be dropped.
This is to avoid double-counting or aggregating readings that may be calculated
differently.
Aggregation Types
Like StatsD Metric readings, memfaultd uses different aggregation types
internally when aggregating Metrics it collects itself. The different
aggregation types are listed below:
Histogram: An average of all readings captured within a Metric Report, weighted evenly. Some Histogram Metrics also report minimum and maximum values -- see Min/Max Metrics below.Gauge: Stores the most recent value for a Metric. When a new reading is received, the old value is discarded and replaced with the new reading's value.Counter: A simple monotonically increasing sum. New readings within the duration of a Metric Report are added to the sum of all previous readings captured within that Metric Report.String: Stores a string value as a report tag. Used to capture identifying information such as Device names or OUI values.
Min/Max Metrics
For Histogram Metrics, memfaultd normally reports only the average of all
readings captured during a Metric Report interval. When min/max reporting is
enabled for a Metric, memfaultd will emit three values per interval instead:
<metric_name>-- the average of all readings<metric_name>_min-- the minimum reading observed<metric_name>_max-- the maximum reading observed
This is useful for detecting spikes or drops that would otherwise be hidden by averaging -- for example, a brief CPU spike to 100% that averages out to 40% over the heartbeat interval.
Default min/max Metrics
The following built-in Histogram Metrics report min/max values automatically:
| Metric pattern | Description |
|---|---|
cpu_usage_pct | Overall CPU usage |
memory_pct | Overall memory usage |
cpu_usage_<process>_pct | Per-process CPU usage |
memory_<process>_pct | Per-process memory usage |
interface/<interface>/bytes_per_second/rx | Network interface RX throughput |
interface/<interface>/bytes_per_second/tx | Network interface TX throughput |
thermal/* | All thermal zone metrics |
Adding custom min/max Metrics
You can configure additional Histogram Metrics to report min/max values using
the metrics.min_max_metrics configuration option. For example, to track
min/max for custom application Metrics:
{
"metrics": {
"min_max_metrics": ["my_custom_metric", "my_service_latency"]
}
}
This will cause my_custom_metric and my_service_latency to each emit _min
and _max values alongside their averages in every Metric Report. These custom
entries are merged with the default min/max metrics listed above.
Min/max reporting only applies to Histogram Metrics. Gauge and Counter Metrics are not affected by this setting.
CPU
| Metric | Description | Aggregation Type |
|---|---|---|
cpu/cpu/percent/idle | The percent of time this core spent in the idle state | Histogram |
cpu/cpu/percent/system | The percent of time this core spent in the system state | Histogram |
cpu/cpu/percent/user | The percent of time this core spent in the user state | Histogram |
cpu/cpu/percent/iowait | The percent of time this core spent in the iowait state | Histogram |
cpu/cpu/percent/irq | The percent of time this core spent in the irq state | Histogram |
cpu/cpu/percent/softirq | The percent of time this core spent in the softirq state | Histogram |
cpu/cpu/percent/nice | The percent of time this core spent in the nice state | Histogram |
Memory
| Metric | Description | Aggregation Type |
|---|---|---|
memory/memory/free | The amount of free memory on the system in bytes | Histogram |
memory/memory/used | The amount of used memory on the system in bytes | Histogram |
memory/memory/cached | The amount of cached memory on the system in bytes | Histogram |
memory/memory/buffered | The amount of buffered memory on the system in bytes | Histogram |
memory/memory/slab_recl | The amount of reclaimable slab memory on the system in bytes | Histogram |
memory/memory/slab_unrecl | The amount of unreclaimable slab memory on the system in bytes | Histogram |
Virtual Memory
| Metric | Description | Aggregation Type |
|---|---|---|
memory/vm/swaps_in_per_second | Pages swapped in per second from /proc/vmstat | Histogram |
memory/vm/swaps_out_per_second | Pages swapped out per second from /proc/vmstat | Histogram |
memory/vm/pages_in_per_second | Pages paged in per second from /proc/vmstat | Histogram |
memory/vm/pages_out_per_second | Pages paged out per second from /proc/vmstat | Histogram |
Temperature
| Metric | Description | Aggregation Type |
|---|---|---|
thermal/<thermal zone type>/temp | Each thermal zone temperature reading in degrees Celsius | Histogram |
Network Interfaces
By default, memfaultd will detect interfaces whose name does not start with
lo, tun, dummy, veth, or usb (to avoid automatically tracking virtual
interfaces) and monitor them via the following metrics.
A specific set of interfaces can be specified via the
metrics.system_metric_collection.network_interfaces configuration.
| Metric | Description | Aggregation Type |
|---|---|---|
interface/<interface>/bytes_per_second/rx | Bytes per second received on this interface | Histogram |
interface/<interface>/bytes_per_second/tx | Bytes per second sent on this interface | Histogram |
interface/<interface>/errors_per_second/rx | Errors per second for RX traffic on this interface | Histogram |
interface/<interface>/errors_per_second/tx | Errors per second for TX traffic on this interface | Histogram |
interface/<interface>/dropped_per_second/rx | Packets dropped per second for RX traffic on this interface | Histogram |
interface/<interface>/dropped_per_second/tx | Packets dropped per second for TX traffic on this interface | Histogram |
interface/<interface>/packets_per_second/rx | Packets received per second on this interface | Histogram |
interface/<interface>/packets_per_second/tx | Packets sent per second on this interface | Histogram |
interface/<interface>/total_bytes/rx | Total bytes received on this interface since the last reading | Counter |
interface/<interface>/total_bytes/tx | Total bytes sent on this interface since the last reading | Counter |
interfaces/<interface>/rssi | RSSI of the connection on this interface to a connected AP/router (wireless only) | Histogram |
Wireless Networks
| Metric | Description | Aggregation Type |
|---|---|---|
wireless/oui/local_<device> | Local OUI (Organizationally Unique Identifier) for the specified Device | String |
wireless/oui/ap_<device> | Access Point OUI (Organizationally Unique Identifier) associated with the specified local Device | String |
Process-level Metrics
memfaultd will capture the following metrics for processes specified in the
metrics.system_metric_collection.processes configuration.
<name> is the filename of the process's executable. If it is longer than 16
characters, it will be truncated to the first 16 characters in the filename.
| Metric | Description | Aggregation Type |
|---|---|---|
processes/<name>/rss_bytes | Resident set size of the process in bytes | Histogram |
processes/<name>/vm_bytes | Virtual memory size of the process in bytes | Histogram |
processes/<name>/num_threads | Number of threads in the process | Histogram |
processes/<name>/cpu/percent/user | Percent of CPU time the process spent in user mode | Histogram |
processes/<name>/cpu/percent/system | Percent of CPU time the process spent in kernel mode | Histogram |
processes/<name>/pagefaults/minor | Minor page faults (not requiring disk I/O) since the last reading | Histogram |
processes/<name>/pagefaults/major | Major page faults (requiring disk I/O) since the last reading | Histogram |
Disk Space Metrics
By default, memfaultd will detect disks whose ID starts with /dev (excluding
loop and ram devices) and track their disk space with these metrics.
Other disks or partitions can be specified via the
metrics.system_metric_collection.disk_space configuration.
| Metric | Description | Aggregation Type |
|---|---|---|
disk_space/<disk>/free_bytes | Bytes on the corresponding disk or partition that are unused | Histogram |
disk_space/<disk>/used_bytes | Bytes on the corresponding disk or partition that are in use | Histogram |
Diskstats Metrics
By default, memfaultd will collect disk I/O statistics from /proc/diskstats
for all Devices (excluding loop and ram Devices). A specific set of Devices
can be specified via the metrics.system_metric_collection.diskstats
configuration.
| Metric | Description | Aggregation Type |
|---|---|---|
diskstats/<device>/reads_per_second | Read operations per second on this Device | Histogram |
diskstats/<device>/writes_per_second | Write operations per second on this Device | Histogram |
MMC Disk Metrics
For eMMC/MMC block Devices (Devices with names matching mmcblk*), memfaultd
collects additional disk health and identification Metrics.
| Metric | Description | Aggregation Type |
|---|---|---|
diskstats/<disk>/lifetime_remaining_pct | Estimated lifetime remaining percentage (type A) | Gauge |
diskstats/<disk>/lifetime_b_remaining_pct | Estimated lifetime remaining percentage (type B) | Gauge |
diskstats/<disk>/bytes_written | Bytes written to this Device since the last reading | Counter |
diskstats/<disk>/total_size_bytes | Total size of the Device in bytes | Gauge |
diskstats/<disk>/name | Product name of the Device | String |
diskstats/<disk>/manufacturer_id | Manufacturer ID of the Device | String |
diskstats/<disk>/manufacture_date | Manufacture date of the Device | String |
diskstats/<disk>/revision | Firmware Revision of the Device | String |
diskstats/<disk>/serial | Serial number of the Device | String |
Log Counter Metrics
memfaultd includes a number of built-in
Log Filtering rules that increment counter
metrics when a matching log line is received.
| Metric | Description | Aggregation Type |
|---|---|---|
oomkill_<process_name> | Count of times process_name has been killed by the OOM killer | Counter |
systemd_restarts_<service_name> | Count of times the service_name service has been restarted | Counter |
This metrics require the logging feature to be enabled and for logs to be sent
to memfaultd.