Software RAID, Hardware RAID, and Woes Oh My!

By Paulus, 7 November, 2007

I woke up on Halloween to the sound of my music playing. Then all of a sudden it stopped. I got up not thinking much of it. The first thing I do when I get up in the morning is check my email. When Thunderbird didn't open and programs stopped responding I got suspicious. I dropped in to VT12 to see what was going on. What was being reported was:

Oct 31 08:15:07 black Buffer I/O error on device md4, logical block 279433447
Oct 31 08:15:07 black Buffer I/O error on device md4, logical block 279433447
Oct 31 08:15:07 black Buffer I/O error on device md4, logical block 279433447
Oct 31 08:15:07 black Buffer I/O error on device md4, logical block 279433447
Oct 31 08:15:07 black Buffer I/O error on device md4, logical block 279433447
Oct 31 08:15:07 black Buffer I/O error on device md4, logical block 279433447
Oct 31 08:15:07 black Buffer I/O error on device md4, logical block 279427948
Oct 31 08:15:07 black Buffer I/O error on device md4, logical block 279427948
Oct 31 08:15:07 black Buffer I/O error on device md4, logical block 279427948
Oct 31 08:15:07 black Buffer I/O error on device md4, logical block 279427948

My first reaction was "Oh crap, I have a bad hard drive." Going through the log file I found where the problems started:

Oct 31 05:00:42 black ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x1950000 action 0x2 frozen
Oct 31 05:00:42 black ata6.00: tag 0 cmd 0xea Emask 0x14 stat 0x40 err 0x0 (ATA bus error)
Oct 31 05:00:42 black ata5.00: exception Emask 0x10 SAct 0x0 SErr 0x1850000 action 0x2 frozen
Oct 31 05:00:42 black ata5.00: tag 0 cmd 0xea Emask 0x14 stat 0x40 err 0x0 (ATA bus error)
Oct 31 05:00:42 black ata6: soft resetting port
Oct 31 05:00:42 black ata5: soft resetting port
Oct 31 05:00:42 black ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 31 05:00:42 black ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 31 05:00:42 black ata5.00: configured for UDMA/133
Oct 31 05:00:42 black ata5: EH complete
Oct 31 05:00:42 black ata6.00: configured for UDMA/133
Oct 31 05:00:42 black ata6: EH complete
Oct 31 05:00:42 black SCSI device sdf: 976773168 512-byte hdwr sectors (500108 MB)
Oct 31 05:00:42 black sdf: Write Protect is off
Oct 31 05:00:42 black sdf: Mode Sense: 00 3a 00 00
Oct 31 05:00:42 black SCSI device sdf: drive cache: write back
Oct 31 05:00:42 black SCSI device sde: 976773168 512-byte hdwr sectors (500108 MB)

After about 10 minutes of two of the hard drives could no longer recover and disappeared from the system. I rebooted the system and got the drives to come back. The first thing I did after logging back in was making sure my drives were ok:

# for s in "c" "d" "e" "f"; do smartctrl -d ata --test=long /dev/sd$s; done
# for s in "c" "d" "e" "f"; do smartctl -d ata --all /dev/sde | grep "SMART Error Log" -A 20; done

No errors, so maybe the system just hiccupped. I really didn't think about it. I figured that maybe the system fixed itself when I rebooted. I was wrong, I was seeing:

Oct 31 16:31:08 black md: Autodetecting RAID arrays.
Oct 31 16:31:08 black md: autorun ...
Oct 31 16:31:08 black md: considering sdf1 ...
Oct 31 16:31:08 black md: adding sdf1 ...
Oct 31 16:31:08 black md: adding sde1 ...
Oct 31 16:31:08 black md: adding sdd1 ...
Oct 31 16:31:08 black md: adding sdc1 ...
Oct 31 16:31:08 black md: sdb3 has different UUID to sdf1
Oct 31 16:31:08 black md: sdb2 has different UUID to sdf1
Oct 31 16:31:08 black md: sdb1 has different UUID to sdf1
Oct 31 16:31:08 black md: sda3 has different UUID to sdf1
Oct 31 16:31:08 black md: sda2 has different UUID to sdf1
Oct 31 16:31:08 black md: sda1 has different UUID to sdf1
Oct 31 16:31:08 black md: created md4
Oct 31 16:31:08 black md: bind
Oct 31 16:31:08 black md: bind
Oct 31 16:31:08 black md: bind
Oct 31 16:31:08 black md: bind
Oct 31 16:31:08 black md: running:
Oct 31 16:31:08 black md: kicking non-fresh sdf1 from array!
Oct 31 16:31:08 black md: unbind
Oct 31 16:31:08 black md: export_rdev(sdf1)
Oct 31 16:31:08 black md: kicking non-fresh sde1 from array!
Oct 31 16:31:08 black md: unbind
Oct 31 16:31:08 black md: export_rdev(sde1)
Oct 31 16:31:08 black raid5: device sdd1 operational as raid disk 1
Oct 31 16:31:08 black raid5: device sdc1 operational as raid disk 0
Oct 31 16:31:08 black raid5: not enough operational devices for md4 (2/4 failed)
Oct 31 16:31:08 black RAID5 conf printout:
Oct 31 16:31:08 black --- rd:4 wd:2
Oct 31 16:31:08 black disk 0, o:1, dev:sdc1
Oct 31 16:31:08 black disk 1, o:1, dev:sdd1
Oct 31 16:31:08 black raid5: failed to run raid set md4
Oct 31 16:31:08 black md: pers->run() failed ...
Oct 31 16:31:08 black md: do_md_run() returned -5
Oct 31 16:31:08 black md: md4 stopped.
Oct 31 16:31:08 black md: unbind
Oct 31 16:31:08 black md: export_rdev(sdd1)
Oct 31 16:31:08 black md: unbind

Apparently mdadm was not being friendly. I tried to reassemble the RAID, but was unable to. Finally while reading the man pages I tried this command:

# mdadm -Av -f /dev/md4 --uuid=`mdadm -E /dev/sdc1 | grep UUID | cut -f2,3,4,5 -d":" | cut -f2 -d" "` /dev/sd[c-f]1

The -A is telling mdadm that you want to do an assemble. '-f' is forcing the assemble regardless of whether the drives are dirty or not. The uuid is telling which array to rebuild. When an array is created each drive is given an UUID. I'm getting the UUID from one of the drives and parsing it. The last part, after the '--uuid' flag is telling mdadm which drives to use and what partition.

What is the Difference Between Software and Hardware RAID?

I have gotten into many arguments about this one. My first true RAID card was a 3ware Escalade 6410 after I returned a cheap Promise card. When I wanted to return the card I was asked why, and I told the Tech Support Specialist at Promise why. They kept telling me that it was in fact a hardware RAID. I told them that if it was a true Hardware RAID the card would do the RAIDing. Again, they insisted that the chip does it. Perhaps it does some, but the vast majority of the RAIDing is done in the kernel driver. With the RAIDing being done in the driver this meant that my CPU performance would take a hit. In the end they let me return the card, but still insisted that the RAIDing was done on the chip.

In May of this year, I attended an AMD & Microsoft Tech Tour. Asus was there and I was looking into purchasing a new motherboard for a faster system. Again I got into this argument. His basis that the RAID controller was hardware was that there is a chip, therefore the RAID operations must be done in the hardware. I threw all the facts at him, but he did not budge on his position. Whether it was ignorance or him trying to sell me on a board I will never know.

The difference between Software and Hardware RAID is so simple, I can not believe that so many people believe that because a chip on the motherboard says it's RAID it must in fact be hardware RAID. Software RAID is when the RAIDing operations are done in a driver and CPU cycles are taken in order to perform the task. The drives for this RAID are still visible to the system in the dev (*nix operating systems, In Windows, you'll never see them) directory. If the card that is being used then the drives are separate, or outside of the computer. All operations are done on the CPU that is built in to the card. If the RAID card is a true RAID then the connected drives can not be individually seen. This is because the RAID card is saying "Hey, I'm one big massive drive!"

Regardless of whether or not you have hardware or software RAID you will need some sort of driver. This is a must for any device in a system. There are times when buying an expensive RAID card just isn't worth it. Other times it's a must.

Software RAID Pros

  1. You are not locked into a vendor. If a RAID card fails, then you are not at the mercy of the vendor. Simply buy another card, and run the utilities to reassemble the RAID.
  2. Faster, as the RAID operations are run on the host CPU. The faster the CPU the faster the operations can complete.
  3. Greater flexibility. With software RAID you can add drives and change the configuration easier.
  4. Is very cheap. The cost of a simple controller ranges from $15 to about $150 depending on how many channels you want.
  5. Feature set is not locked in stone. If a new RAID level is developed, then it's as easy as compiling the driver, adding it to the running kernel and reconfiguring an existing array with out any data loss.

Software RAID Cons

  1. If there is a glitch in the driver then data may be lost.
  2. Overall performance. The more complex your RAID is the more cycles are going to be stolen from the host CPU.
  3. May be affected by virus' or other harmful software.
  4. Vulnerable to system crashes. If the system is crashing then you could take out the entire RAID.

Hardware RAID Pros

  1. The calculations are done on the card and not on the CPU.
  2. Manage disk across different O/Ses
  3. Add a battery backup to improve performance on writing.
  4. If there is a power loss then the data can be stored in memory until power is restored.
  5. Not vulnerable to virus' or other harmful software.
  6. Uses dedicated controller memory. Depending on the card, it may be possible to upgrade the memory.

Hardware RAID Cons

  1. May not be as fast as software depending on the current hardware configuration.
  2. Expensive.
  3. The RAID engine is proprietary. So if the card fails then you have to replace it with the same card or at least a card from the same vendor.
  4. Bad firmware. I have to add this just to be fair. However, it's more likely that you're going to have bad software before you have bad firmware.

Conclusion

Hardware verses Software raid can be a religious battle. Depending on what you need and how important your data is. Since my first 3ware card I've always sworn by it. I still have my Escalade 6410 card. It still happily does the job that it did when I first bought it, which was 6 years ago. I have never had a problem using hardware RAID opposed to software. At work I've set up RAIDs for customers and to be absolutely honest, it's a craps shoot. Sometimes it works great other times it doesn't do the job it's supposed to.

If you use RAID 0 or 1 (stripping or mirroring, respectfully) then a simple Silicon Image, Promise, Highpoint or any other RAID card that you can buy for under $100 will do the job just fine. There is little performance hit for these levels. However, when doing RAID 5, 5EE, 6, or 50 there is a lot more overhead because of the algorithm being used to split the data among the drives. The more drives you have the more of a performance hit you will see. Since I run RAID 5 for my data, I choose to a AMCC 3ware 9650-8LPML card. Currently I only have 4 drives, but when the price of a 500GB drive falls I will be adding 4 more. Since switching, I have noticed that my CPU is not running at 100% when I'm transferring massive amounts of data.

If you do choose to go software with your RAID 5, then prepare for a performance hit when copying files. Also if the O/S crashes or you then be prepared that there maybe data loss. Also, you have to reassemble the RAID which may take some time if it doesn't behave like it should. In a business environment you should really only use hardware RAID.

So which is better? Depends on the application.