Nvidia RAID Users; Fedora 10 might eat your data

Posted by Roel Gloudemans on 7 December 2008 | 0 Comments

Tags:

We interrupt this boring strings of blogs for some real news.

Out of habit and curiosity I always install the latest Fedora release on my desktop as soon as it is proven that the Nvidia display driver works (Hey, if you work for large cooperations, Quake 3 or OpenArena is a necessity ;-) ). This is exactly what I did when Fedora 10 came out. It installed OK and then ran fine ... until I decided to change something in the grub config.

I made a typo, so the system didn't boot any more. No worries, the rescue CD is here. As soon as I repaired the error I got some disk warnings on the new spiffy Fedora start screen. Initially I ignored these. In the session that followed I got bus errors from Java. My wine apps wouldn't even start any more, so I rebooted and pressed <esc> during the boot so I could see the start-up messages scrolling by. I had forgone this initial step during first boot; and with disastrous results.

During boot the system is complaining about a duplicate physical volume (PV) ID and it tells me it is using /dev/sdb and not /dev/sda. Hang on; I have the raid controller on my Asus A8N-E (Nforce4 chipset) configured. Shouldn't it be using /dev/dm-[something] of even something from /dev/mapper? Turns out the raid set is not initialized and because it is one of those fake-raids, the individual disks are reported to the system as well.

So here is what happened; the raid set was never initiated after installation (during installation it went fine) and only the second disk from the raid set was used. This will work fine, as long as this disk doesn't run into problems. The rescue CD however initializes the raid controller properly. When that is the case, the first disk from the set is leading. So when I repaired grub using the rescue disks, I actually corrupted the disk Fedora 10 was using.

This bus is already reported to the Fedora bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=474697.

The scary part of this is what will happen if this bug is fixed. If no special attention is called to it, after the installation of the patch for this problem all Nvidia raid users will, after reboot, be confronted with the system as it was right after initial install, with no more access to the data.

The only way out then is to boot from the rescue disk. Do not mount the filesystems on disk (this is what caused my problem), but do initialize the network card. If you are using lvm you can then discover and activate the volume group without initializing the raid array ("lvm vgscan" and "lvm vgchange -a y [volume group name]"). If you are not using lvm, you can skip that step. You can then mount the volume or partition to see if you still have access to your data, but I suspect you will have to do a disk check first (use fsck -f -a; because there will be a lot of errors). After I repaired the disk, most of my stuff was under a subdir in the lost+found of that disk. I didn't lose anything (except my pride) and luckily it happened only days after installation, so I still had a recent backup.

What should you do when you have a Nvidia raid controller and haven't noticed yet?

  • Hit <esc> during boot and see if you are hit by this problem. If you see something about a duplicate PV you have a problem
  • Continue the boot and backup your data to another system of DVD
  • Reboot you system and disable the raid controller from the BIOS setup
  • Reboot and re-install Fedora using software RAID. And no, you will not loose any performance, the hardware RAID was fake anyway. You were already using your main CPU to mirror the data.