Skip to content

feat: Persistent Active Boot Log Collection into a Persistent Directory #3381

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
pavansokkenagaraj opened this issue Apr 30, 2025 · 2 comments
Labels
enhancement New feature or request triage Add this label to issues that should be triaged and prioretized in the next planning call

Comments

@pavansokkenagaraj
Copy link

Is your feature request related to a problem? Please describe.

Yes — when an upgrade fails or a node reboots unexpectedly (e.g., power cycle during upgrade), the system may fall back to booting the passive image due to upgrade_failure flags or incomplete boot assessments.
Currently, there’s no persistent logging of the active boot process, making postmortem diagnostics difficult, especially in dark-site environments with no console or GRUB access.
This results in time-consuming manual debugging, as we cannot easily determine the cause of fallback or incomplete boot.

Describe the solution you'd like

Write active boot logs into a persistent directory

Describe alternatives you've considered

Additional context

This enhancement would support better RCA and faster triage for upgrade failures or boot rollbacks in real-world field deployments, particularly on edge systems.

Slack thread for context:
SpectroCloud Slack – discussion on boot log persistence

@pavansokkenagaraj pavansokkenagaraj added enhancement New feature or request triage Add this label to issues that should be triaged and prioretized in the next planning call labels Apr 30, 2025
@Itxaka
Copy link
Member

Itxaka commented Apr 30, 2025

note that usually this happens automatically as long as active is able to switch root into the final system, as at that point /var/log/ is mounted into persistent.

Here is an example of a machine that I set up a couple of days ago and just booted again, all logs are preserved:

Image

But there is a point between kernel -> initramfs -> immucore -> switch root in which those logs are volatile as there is no persistence anywhere. Initramfs journal is logging into memory until it does the switch root if I believe at which point journal stores the in memory logs into disk. There is not mcuh that we can do at that point to store logs.

As that failure (again, first time we see this in more than 2 years with Kairos) cold have happen at any point before what stores the logs to disk, there is not much that we can do.

One could build a custom initramfs that provides log forwarding in real time to a remote syslog but that would require knowing in advance the remote to build it for, which I think its a very specific feature that does not fall onto Kairos broader usage.

Also notice that even if that was to be added by a Kairos consumer, if network or journald are not up before the crash occurs, you would hit the same issue, as no logs would be forwarded.

If you want to support boot rollback or identify the failure its very simple, if after an upgrade the system is booting on passive (you can stat /run/cos/passive_mode) then there has been a problem booting in active. Or you can check the /proc/cmdline to check for the upgrade_failure key. A simple reboot migth fix it like it did in the only case reported and it can be easily set to boot on active via an ssh command to the machine and then reboot the machine.

@Itxaka
Copy link
Member

Itxaka commented Apr 30, 2025

it would be possible to forwards all logs via journalctl itself to other places but note that this is nothing that we can do by default as it requires either a know url, or when done to console/kmesg is not recommended as its very slow.

Users can add to cmdline, if the active keeps rebooting/crashing, any of

systemd.journald.forward_to_syslog
systemd.journald.forward_to_kmsg
systemd.journald.forward_to_console
systemd.journald.forward_to_wall

to enable those and see more info or they can configure things like https://www.freedesktop.org/software/systemd/man/latest/systemd-journal-upload.service.html to trigger as a dep of the rescue/reboot to upload the log messages before a crash reboot is triggered, but again all of this things have their own caveats and are best left to the user

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triage Add this label to issues that should be triaged and prioretized in the next planning call
Projects
Status: No status
Development

No branches or pull requests

2 participants