Torizon Updates Technical Overview

Applicable for

TorizonCore (Linux)

Torizon 5.0.0

Introduction

TorizonCore is built with OSTree and Aktualizr-Torizon, the former is a shared library and suite of command-line tools that combines a "git-like" model for committing and downloading bootable filesystem trees, along with a layer for deploying them and managing the bootloader configuration. The latter is a fork of Aktualizr, a "daemon-like" open-source implementation of the Uptane SOTA standard that secures updates from end-to-end.

OSTree and Aktualizr-Torizon are complementary and together they form the foundation for OTA (over-the-air) and offline update capabilities on the device.

On the server-side, Toradex is working on a cloud-based hosted option as well as an on-premise option to provide a complete OTA and offline solution that works with TorizonCore.

This article complies to the Typographic Conventions for Torizon Documentation.

OSTree

OSTree has its own article, please refer to OSTree for a brief overview and a demonstration of how to use it.

Our remote and offline updates implementation allows us to update the following components:

Kernel
Device tree and device tree overlays
initramfs
Root filesystem
- /usr is updated
- /var and /home are ignored, they keep their content over updates
- /etc is updated via a 3-way merge

The following components cannot be updated at the moment, though we plan to support them in future releases of TorizonCore:

Bootloader artifacts
Arm Cortex-M firmware (on SoMs where a Cortex-M is available)

Uptane

Uptane is a de facto automotive SOTA standard, held by a non-profit consortium named Uptane Alliance under the IEEE/ISTO Federation. Its focus is to enable secure software updates over-the-air resiliently. It relies on multiple servers to provide security by validating data before a download starts and ensuring that even an offline attack that compromises a single server would still not be enough to compromise the system security. Uptane is an enhancement to the TUF (The Update Framework) security framework, which is currently a very widely used framework to secure software and package updates on computers and smartphones. The motivations to expand the TUF framework are described in detail in the Uptane Design page and a favorable explanation of TUF is in its docs page Understand the Notary service architecture.

Aktualizr-Torizon

Aktualizr-Torizon is TorizonCore's client implementation of Uptane (forked from Aktualizr's default client). It is written in C++ and its responsibility is to communicate with a Uptane compatible server. It verifies if new downloads are available, installs those updates on the system, and reports status to the server while guaranteeing the integrity and confidentiality of OTA and offline updates. Aktualizr handles Docker image updates seamlessly by using Docker Compose YAML files.

How to Use Aktualizr-Torizon

Aktualizr - Modifying the Settings of Torizon OTA Client is a dedicated article that covers the practical aspects, including its usage.

Update rollbacks

There can be cases where the system may fail to boot or the boot process is considered unsuccessful either due to kernel panic or failure to start any critical user-space application. These issues can be handled by developers during development, but it becomes a nightmare if the solution is deployed and such an issue occurs due to any bad update.

TorizonCore and Aktualizr-Torizon are fully capable to recover from bad updates by doing the following:

Identifying unsuccessful updates and rebooting the device when it occurs.
Rolling back to the previous operating system version after 3 unsuccessful boots.

Identifying unsuccessful OS updates

In TorizonCore, the Linux kernel is configured to panic and reboot in case of freezes or crashes. This helps to recover from bad kernel updates.

At the user-space level, systemd hardware watchdog integration is enabled by default in TorizonCore. That means systemd will regularly ping the watchdog hardware, and if systemd (or the kernel) hangs, this ping will not happen anymore and the hardware will automatically reboot the device. This helps to recover from bad updates when the kernel or the initialization daemon (systemd) is not able to run.

Lastly, TorizonCore will consider a successful boot if the boot-complete systemd target is successfully executed. This is because the main operating system services required for proper operation, including the Docker daemon, are inside boot-complete.target. And if boot-complete.target fails during an update, TorizonCore will automatically reboot. This helps TorizonCore to recover from bad updates when critical processes from the base operating system don't run as expected.

A TorizonCore user may also define his own "rules" to validate an update using the Greenboot framework. Greenboot (Generic Health Check Framework) is a Fedora project that helps manage systemd services health, and TorizonCore uses Greenboot as a framework to make update checks more flexible and manageable for the user. With Greenboot, you can define a shell script that can do additional checks in the system and force a reboot if needed. For more information about how to use Greenboot, have a look at Update Checks and Rollbacks.

Identifying unsuccessful container updates

In addition to general OS updates, you can also separately update the containers (your application) on a TorizonCore device. These types of updates use the same update framework as OS updates but, are otherwise different in some ways.

Most important among these differences are the conditions for a successful container update. Unlike OS updates a container update does not require a reboot. This eliminates the possibility of checks at boot. Furthermore, the use-cases for containers are far more varied making it difficult to have update checks that account for all possible cases.

Therefore we have opted for basic general checks upon update. The checks performed are as follows:

Running docker-compose pull --no-parallel on the new docker-compose.yml, to pull the new container images.
Running docker-compose -p torizon down on the old docker-compose.yml, to stop and remove any container associated with the old docker-compose.yml
Running docker-compose -p torizon up --detach --remove-orphans on the new docker-compose.yml, to bring up the new containers as defined by the parameters of the compose file.
If everything has been successful so far, the new docker-compose.yml overwrites and replaces the old docker-compose.yml file.
Finally running docker system prune -a --force. This will clean up any unused containers, networks, and images from the device.

If any of the above commands “fails” then the entire update is considered failed. With failure being defined by the exit code returned from the command. 0 is considered a success while all other exit codes are failures.

An important thing to note here is that, no further checks are made on the state of the container after being started. This can lead to instances where a container starts successfully, but soon after exits due to an error. By the above checks, this would still be considered a successful update.

This means it is important that you verify the status of your containers after they have started. Remote and offline updates are not equal to health monitoring, which would allow containers to be automatically restarted if they stop meeting user-defined health criteria.

Rolling back to the previous operating system version

As mentioned above, TorizonCore will automatically roll back after 3 unsuccessful reboots. And the automatic rollback feature relies on Aktualizr-Torizon’s rollback support and U-Boot's bootcount feature.

TorizonCore uses Aktualizr-Torizon with rollback_mode set to uboot_masked. This enables Aktualizr-Torizon’s U-Boot bootcount integration:

After an update, Aktualizr-Torizon enables boot counting by setting U-Boot's environment variables upgrade_available to 1 and bootlimit to 3.
In case of a bad update, the system will reboot and U-Boot will increment bootcount environment variable. After three times (when bootcount is greater than bootlimit), the system will roll back to the previously installed OS version.
In case of a good update, Aktualizr-Torizon is normally started and U-Boot environment variables upgrade_available and bootcount are set back to 0.

TorizonCore’s OTA and offline updates allows it to roll back to the last installed update thanks to its OSTree based root file system. It also allows to keep multiple deployments (kernel/initramfs/device-tree and the rootfs) on a system and have them bootable. The initial (factory) image has only a single deployment available and is assumed to be a working deployment (no rollback can be done at this point). After the first update has been rolled out, there will be two deployments on the system at all times. If a new deployment fails, the system will automatically roll back to the previous deployment.

Note: When installing an update without Aktualizr-Torizon (e.g. using ostree admin directly) automatic rollback will not work. To use automatic rollback in a pure OSTree system, those steps need to be executed manually as described in Ostree!

Synchronous updates (5.4.0)

Starting with TorizonCore 5.4.0, a new feature "synchronous updates" was added. In the context of Torizon OTA and offline updates, a "synchronous update" refers to the simultaneous update of the OS and application packages on a TorizonCore device. However, such an update is more than just updating 2 components at once. It’s truly synchronous, in the sense that both OS and application must update successfully or they will fail together as if they were a single component.

The main motivation behind synchronous updates is for cases where the OS and application are intertwined. For example, imagine you have a new application that relies on a new driver. This new driver is only available in a new version of the OS. Given the dependency here it’d be ideal to be able to update both OS and application at the same time. Also due to the design of synchronous updates, it’s guaranteed that you’ll end up with a system that either, failed to update or successfully updated both components. This way the system prevents prevent awkward scenarios, where only 1 component is updated successfully leaving you with a halfway updated system.

However, the above example is not the only situation that would warrant synchronous updates. In general, if you want to tie the failure and success states of both an OS and application update to one another, then synchronous updates are the solution. Otherwise, if it is acceptable for the OS and application update to fail or succeed independently of one another, then 2 non-synchronous updates are sufficient.

Synchronous update procedure

Given the requirement to tie the success and failure states of both OS and application updates, the process of a synchronous update differs quite a bit from a standard non-synchronous update. As a user, in order to debug possible update failures, it's important to understand the general process of a synchronous update.

Update check: The process begins with an update check to see if any new updates are available for the system. This is exactly the same as in the non-synchronous case.
Download: Next, if a new update has been confirmed then the device begins fetching the images/firmware needed for the update.
- Download failure: If the download for either component fails the entire download is considered a failure and the update process stops here.
OS installation start: If the download phase was a success now begins the installation phase. First, the OS begins installation, as with the non-synchronous case the update will only be finalized on system reboot.
Application installation start: After the OS installation, next is the application installation. The process from here on differs heavily compared to the non-synchronous case. Initially, all that is done is pulling the new container images as specified by the newly downloaded docker-compose.yml
- Application pre-installation failure: If this fails then a flag is set to rollback the OS update on reboot.
Reboot and OS installation finish: At this point, we have the new OS update pending on the next boot, and we have the new set of container images downloaded. Now we must finalize both updates with a reboot.
- OS installation failure: after the reboot, if the OS update appeared to have failed, thus causing a rollback to the previous OS version. Then, we remove the new docker-compose.yml and prune the new container images from the system.
Application installation finish: After reboot, if it appears that the OS update has succeeded, then we attempt to bring down the current docker-compose.yml and bring up the new docker-compose.yml.
- Application installation failure: If the new docker-compose.yml fails to be brought up successfully, then we remove it and prune the new container images from the system. A flag is then set telling the system to rollback to the previous OS. A reboot is then triggered to perform the OS rollback. The old docker-compose.yml and containers are still in the system, so the rollback set the system to use them.
Update successful and cleanup: If the new docker-compose.yml has been brought up successfully, we then remove the old docker-compose.yml replacing it with the new one. Finally, we prune the system to clean up old containers and images leftover by the now previous docker-compose.yml.

Meeting the Challenge of OTA for Embedded Linux Systems (by Doulos)

Need more help?

Ask the experts in our Community!