This morning I woke up, rebooted a living room pc and got thrown into a 2-hour session of troubleshooting for a problem I do not understand why it exists. I’m writing this in hopes of understanding the whys, and how to avoid similar pitfalls.

I just recently installed a living room pc running Fedora 44, it’s running Plasma Big Screen, and it’s purpose is to be a steam link machine, jellyfin server and maybe a game server down the line for some coop games (zomboid, valheim…). For about a week, everything was perfect.

Until this morning.

After turning on my tv, my system was showing some errors on qbittorrent, and I decided to reboot just in case. And that was when my system just completely locked me out, it threw me into emergency mode and I had no access to root, so nothing could be done, just watch an endless loop of my system trying to do something that was impossible and occasionally pressing enter to restart the loop. That is my first gripe: why throw someone into emergency mode if it’s just going to lock them out?

I tried restarting a few times, unplugging things, reseating ram and the likes just in case. When nothing worked, it seemed I’d have to do research, in the hopes of not having to wipe it clean and start anew.

So, here I went, searching the web with my problem and trying to find a solution. After reading some very long forum posts, I apparently needed more information about what had actually caused this, but it was likely something about fstab. And here is my second gripe: why did the system not immediately inform me of the error first before starting emergency mode? I got 0 error messages because the default setting is Rghb quiet… Is this a thing about just fedora or is every linux distro the same?

(I’m going to add in here that I’m in the process of switching all my pcs to Linux, and this was a first test. But I also am going to switch my family’s pcs, and I need to shine my Linux shoes and put some big boy IT pants for the future, so that’s why I’m writing a post: to learn from your experiences)

So here I go, to do some stuff with GRUB to find the error. I decided to test chatgpt and see if it could guide me (I’m a noob if it wasn’t obvious yet), and took more than an hour of troubleshooting with grub and bash to finally see that the problem was about a drive with an UUID that did not match my system drive (a silver lining I guess). But, here’s the thing, as soon as the reboot loop started, I had an inkling of a suspicion that it might have been one of the old spinning hdds causing it (I need to replace those, but they’re fine and working for now, and in this economy…). So I had unplugged all of them when I did my hardware troubleshooting step, and kept only my nvme disk (which is brand new) on the system up to here. So I had been completely blindsided that even if the drives are disconnected, my system still won’t boot, because it expects the drives to be there, and if they aren’t, even though everything else is working fine, it won’t boot! This is my third gripe. Is this a default setting? Something about Fedora? Why is this the way it is done? It just doesn’t seem logical to me to lock me out of the whole system because a non-essential part is not working/present.

Anyways, after unplugging and re-plugging the drives, I finally discovered it was not my drive, but a pcie sata expansion card that had timed out, and it was this one smaller drive I had been using with the card that was the problem, but after plugging it straight to the mb (the slots are precious, okay? I was saving them for bigger drives in the future), it worked just fine. My system booted normally.

That was 2 hours-ish that could have been just 5 minutes if the system had actually told me it was having problems with connecting to a drive. Also, chatgpt did help, but boy, it didn’t have a good troubleshooting order at all. It was just shooting in every direction and hoping something would stick. But I don’t think trying to find my fix in forum posts would have been any better.

  • FauxLiving@lemmy.world
    link
    fedilink
    arrow-up
    13
    ·
    edit-2
    10 hours ago

    That was 2 hours-ish that could have been just 5 minutes if the system had actually told me it was having problems with connecting to a drive.

    It did, it almost certainly wrote the error into the system log. That’s generally the first thing you should check if you’re having such low-level problems, like a failure to boot.

    Going forward some good advice is: Get a USB stick, install Ventoy and put a system rescue image on it: https://www.system-rescue.org/

    This way, if you’re ever having issues booting you can boot into a live environment and read the logs.

    If you mount your system drive on /sysroot, you can read the system log with

    journalctl --root=/sysroot
    

    Add a -b flag to see just the log from the latest boot (and -b -1 from the previous, -b -2 from the one before, etc)

    • alphabethunter@lemmy.worldOP
      link
      fedilink
      arrow-up
      1
      ·
      3 hours ago

      Thanks! I’ll do that today. I did read a lot of the journal (that was how I eventually figured out the problem), but getting to the journal was already a process. What I don’t understand is why super critical errors are hidden away in a log instead of being shoved in my face once they happen.

      • nyan@sh.itjust.works
        link
        fedilink
        arrow-up
        1
        ·
        26 minutes ago

        Because even a headless server with no email capability can write to a log, as long as it can mount its root drive.

        That being said, if your system is hiding stuff behind some kind of splash screen at boot time, turn it off. I suspect your error would have been right there on screen in plain white-on-black text if it had happened on one of my systems (granted, I use OpenRC and not systemd, but I expect the latter also provides a running commentary on what it’s doing at boot until the graphics stack loads).