I have an NVMe base for my Raspberry Pi 5 and it's been running for a little over a year without any issues. A couple of months ago I switched to Almalinux and noticed that I could SSH into it right after it booted, but around 30-40 minutes after boot, I could ping from another host on my network but could no longer SSH. When I restarted and was directly consoled in, everything was fine until that ~40 minute mark and I started getting a ton of I/O errors. I couldn't even run basic commands like "ls" or "cd". It took me a couple months of off-and-on fiddling to track down the problem.
Initially I thought it was the NVMe base or the drive, but I swapped both out and still had the issue. Then I made the connection that I started receiving I/O errors usually around the same time after bootup. Consoling in, among all the messages, I noticed that the NVMe went into power saving mode not long before the I/O errors started to occur.
I think I solved the issues by preventing the NVMe from going into power saving mode by adding a single line to the Raspberry Pi's configuration and haven't experienced any I/O errors since.
On AlmaLinux edit /boot/efi/cmdline.txt and add the line pcie_aspm=off, then restart. On RaspberryPiOS, it's the same line, but file is located at /boot/firmware/cmdline.txt.
I think the power saving kicks on at different times depending on the OS. On RockyLinux I started having I/O errors almost immediately on startup and didn't get the opportunity to edit the file.
What's odd is that when I was running EndeavourOS I wasn't experiencing this issue until I upgraded the system, so it sounds like a software issue. It's taken me a while to track down this issue, so hopefully I'll never see this pop up again.
Thanks for reading. Feel free to send comments, questions, or recommendations to hey@chuck.is.