The latest version of OS Agent sets haos.wipe=1 as kernel argument to
trigger a device wipe. Let systemd pickup this kernel command line
argument and start haos-wipe.service.
This rather complex architecture allows to add other triggers in the
future, e.g. a button read in the boot loader.
There are incident reports on the internet where poeple report that
fsck.(v)fat actually leads to problems rather file system fixes. Around
the time when Home Assistant OS added fsck.fat for the boot partition,
reports of empty boot partitions or file with weired filenames started
to appear. This could be caused by fsck.fat.
Disable fsck on the boot partition.
The supervisor container requires the "hassio-supervisor" AppArmor
profile. Make sure our AppArmor service hassos-apparmor is a dependency
of the hassos-supervisor.service.
* Use systemd-growfs instead of resize2fs (#1106)
Since systemd 236 systemd has a built-in file system growing mechanism.
The mechanism relies on the kernels online file system resize
capabilities instead of the external resize2fs utility. Online resizing
is supposedly much faster since the kernel takes care of things.
This also makes sure that external file systems get resized which
previously have not been taken care of.
* Drop HA OS specific file system resizing
Since we have systemd-growfs in place now we can drop our file system
resizing code.
* Make sure /dev/disk/by-label/hassos-data is present after resizing
Note: systemd will retry mnt-data.mount later, so at least in theory
this shouldn't really matter. However, the journal has a lot of churn
due to that reordering.
When we write the update to the boot partiton, there is nothing which
makes sure that data is written to disk. This leaves a rather large
window (probably around 30s) where a machine reset/poweroff can lead
to a corrupted boot partition. Use the sync mount option to minimize the
corruption window.
Note that sync is not ideal for flash drives normally. But since we
write very little and typically only on OS update to the boot partition,
this shouldn't be a problem.
* Avoid waiting for external drive unnecessarily
Even though the condition to start hassos-data.service is not met (the
file /mnt/overlay/data-move is not there by default), it seems that
systemd waits for the dependencies for hassos-data.service. Don't
Require or Wants any dependencies which might not be present by
default.
* Use systemd to wait for partition using partlabel device
* Use sfdisk which allows to wipe filesystem signatures
Even though we zap the partition table using sgdisk, the file system
superblock (which contains the file system label) does survive. This
can cause problems when trying to reuse a disk previously already
labeled using hassos-data: It might take precendence on next boot
over the existing data partition on the eMMC.
Make sure to clean all file system signatures using sfdisk.
* Make the datactl command more robust
Validate target disk (partition) size to avoid a copy attempt which will
fail. If e2image operation fails, make sure the leftover copy is not
regonized as data partition.
* Fix hassos-data service device unit dependencies
* Rewrite datactl command
Prepare the target partition as part of the datactl command. Rely on
partlabel for the target disk since we are always using GPT on the
target disk. Use systemd and partlabel mechanism to wait and find
the target data disk. Keep using the file system label to identify
the source disk.
Also use e2image instead of raw dd to move data. This should
speed up the processes significantly.
* Fix corner case when reusing same disk again
In case a container image is corrupted `docker inspect` might fail:
# docker inspect --format='{{.Id}}' "${SUPERVISOR_IMAGE}"
Error response from daemon: readlink /mnt/data/docker/overlay2: invalid argument
In that same state the `docker images` command still shows the images.
Since `docker inspect` returns an error SUPERVISOR_IMAGE_ID will be empty
and a simple `docker pull` will be attempted. That does not suffice to
recover from a corrupted container image.
Use `docker images` to get the image ids and make sure to delete all
image ids found by that command.
Also don't use RuntimeDirectory since it deletes the runtime directory
between the service start attempts which defeats the purpose.
* Simplify self healing capabilities of Supervisor service
Instead of relying on time based information on how long the container
has been running use a startup marker file to infer if the last startup
has been successful.
* Update buildroot-external/rootfs-overlay/usr/sbin/hassos-supervisor
Co-authored-by: Pascal Vizeli <pascal.vizeli@syshack.ch>
Co-authored-by: Pascal Vizeli <pascal.vizeli@syshack.ch>
* automatically fsck to repair partitions
* add fsck.fat so rpi boot partition can be repaired
* Use Wants= instead of Requires=
Co-authored-by: Pascal Vizeli <pascal.vizeli@syshack.ch>
* add dosfstools to all images
* run hassos-data and hassos-expand after fsck
Co-authored-by: Pascal Vizeli <pascal.vizeli@syshack.ch>
The Docker socket path is /run/docker.sock. Also only one path can be
used per property. This fixes the supervisor service, which currently
refuses to start due to missing Docker socket.
dhclient and systemd-journald will be running during shutdown and are
only killed in the final shutdown fase. Unmounting the directories
they use will fail. Use lazy unmouting to fix this.
On systems where ACPI support is present as inidcated by the presence of
/proc/acpi (e.g. on OVA compatible hypervisors), we want to properly
shut down the system when the power button is pressed (or the hypervisor
simulates this kind of event to the guest machine that executes hassos).
This changeset provides the following basic infrastructure for this
feature to work as expected:
* a systemd service to start acpid, if ACPI support can be assumed
* an acpid configuration directory
* a trivial shutdown script to invoke when a PWR event is registered