Reliability on Torizon

Applicable for

TorizonCore (Linux)

Torizon 5.4.0

Introduction

Reliability is an important topic for embedded systems. Once you have deployed thousands of devices to the field, malfunction or successful attacks may cause harm to people and equipment and may imply costs for on-site maintenance.

Torizon strives to be a reliable system from its conception and at all levels. Be it on TorizonCore or our tools and features, like TorizonCore Builder and our OTA update system, we care about providing safe defaults and guidance to our customers.

In this article, we go through features you can use to further increase the reliability of your product:

Docker data integrity checker: recover from data corruption in extremely adverse situations.
Docker container health monitor: restart a container if a certain condition fails.

Applying Configuration to a Custom TorizonCore Image

Once you apply the features described in this article to a board, you must create a custom TorizonCore image with the exact same configuration to install on several boards during production programming.

You can do this with the TorizonCore Builder Tool - Customization for Production Programming and Torizon OTA, more specifically by Capturing Changes in the Configuration of a Board on TorizonCore.

Prerequisites

It is recommended that you:

A SoM with TorizonCore 5.4.0 or newer.
Have a brief understanding of the TorizonCore technical overview.
Have a brief understanding of the update system technical overview.
Adhere to the Torizon Best Practices Guide while doing development.
Subscribe for updates on our TorizonCore Issue Tracker.

Docker data integrity checker

Docker data might get corrupted on the device. It is a rare situation but may happen in some specific cases like malfunctioning hardware or unintended powercuts during write operations in the storage device (NAND or eMMC). The risk is minimized in TorizonCore because most filesystem is mounted read-only and journaling is enabled on read-write mount points. Anyway, if it happens, it can result in containers not being able to start.

To avoid such situations, there is a feature called Docker integrity checker in TorizonCore.

If the docker-compose systemd service is not able to start all containers successfully, the docker-integrity-checker systemd service will be triggered.

This service will perform an integrity check on all installed Docker images that are defined in the /var/sota/storage/docker-compose/docker-compose.yml file because this is the file used by docker-compose.service.

If any of the Docker images are identified as corrupted, they will be deleted and re-pulled from the container registry again.

This feature is currently disabled by default in TorizonCore, and can be enabled by creating the /etc/docker/enable-integrity-checker file:

# touch /etc/docker/enable-integrity-checker

Warning: This feature can create additional network traffic in case a corrupted container image is detected.

Docker container health monitor

It might happen sometimes that a container appears to be up and running, but it’s not running as desired. To improve the reliability of the system, TorizonCore is able to monitor the health of running containers, and restart them if needed.

To monitor a container in TorizonCore, one must:

Declare a user-defined check to determine the health state of a running container
Label the container with "autoheal=true"
Enable docker-watchdog.service systemd service

Given the above conditions, TorizonCore will check the container for its health state every 5 minutes and restart it if the "unhealthy" state is detected.

User defined check

Docker containers can be configured with a check to determine whether or not running containers are in a "healthy" state.

Here is an example of defining a health check. In this case, it will check for the existence of /tmp/.X11-unix/X0 file:

healthcheck:
    test: ["CMD", "test", "-S", "/tmp/.X11-unix/X0"]
    interval: 5s
    timeout: 4s
    retries: 2
    start_period: 10s

If the file doesn’t exist, the container will became "unhealthy". More information about Docker healthcheck is available in the Docker Compose file reference.

Label

Every container that is going to be monitored has to be labeled as “autoheal=true”:

    labels:
      - autoheal=true

Enabling docker-watchdog service

The docker-watchdog systemd service can be enabled by running:

# sudo systemctl enable docker-watchdog.service

After enabling and starting this service, all containers configured with a health check as stated above will be monitored and restarted if they became "unhealthy".

Docker live restore

Docker live restore is a feature meant to keep containers alive during daemon downtime.

It is not enabled and it doesn't make sense to use it in the default context of TorizonCore, because:

The Docker daemon systemd service is configured to reboot the board if the daemon crashes.
The Torizon Updates system requires a reboot for OS updates, which is the case when the Docker daemon is updated.

If the default behavior of rebooting does not meet your use case, edit the corresponding systemd service docker.service to disable it and prevent it from restarting automatically, then enable the Docker live restore. Other changes to the default behavior may be required, and they may also introduce unknown consequences.

Warning: use this feature at your own risk, we do not test it extensively. Be aware that we count unsuccessful reboots to rollback failed updates. By changing the default behavior, you are removing a mechanism that can detect a potentially bad update and trigger a rollback. Other unknown side-effects may be present, make sure to validate the changes for your own use case.

To test things, make sure there is a running container, forcibly kill the Docker daemon manually, and start it again:

sudo killall --signal SIGKILL /usr/bin/dockerd
sudo systemctl start docker

Tip: if you restart the service with sudo systemctl restart docker, the containers will be stopped and restarted as well, regardless of the Docker configuration.

To make changes reproducible, Capture Changes in the Configuration of a Board on TorizonCore.

Need more help?

Ask the experts in our Community!