Hard disk capacities are rapidly increasing, and so are RAID rebuild times. The problem is particularly apparent with SATA drive technology, but as 300 GB Fibre Channel drives become more prevalent, even Fibre Channel-based arrays are suffering from long array rebuild times. Many storage administrators are using fewer and fewer drives in an array group. Although using larger capacity drives may make economic sense, fewer drives equate...
to longer rebuild times. In a recent test we conducted, a RAID 5 array with five 500 GB SATA drives took approximately 24 hours to rebuild. With nine 500 GB drives and almost the exact same data set, it took fewer than eight hours. Below are some tips to help you control RAID rebuild times.
Don't deploy RAID blindly.
Understand exactly what the rebuild times are and plan for the best implementation. When you evaluate a system after loading data, simulate a drive failure and measure the rebuild time. Use the same approach that suppliers do when they implement a new technology, from how drive failures are handled to the amount of processing power available to assist with the RAID recalculation.
Virtualize your data storage systems.
Virtualized storage systems use a much higher number of drives in the volume group. As a result, a single drive failure can be recovered very quickly because the rebuild I/O load is distributed across many other drives. This improvement can be as much as six times. These virtualized systems can often begin a volume rebuild sooner because they deal with data at a much finer level of granularity than a traditional array. Virtualized systems also extract a much lower performance impact on the overall system as the rebuild occurs, typically less than 1.5% additional load.
If you can't work with a virtualized system, then consider mirroring. Although mirroring -- or RAID 1 -- requires more capacity than other RAID configurations, capacity is cheap. Your data, or at least the cost to recreate it, isn't. Instead of running all of your volumes as RAID 5 or RAID 6, explore putting your critical data on a mirrored volume. Overall performance is faster on a mirrored volume and there's no RAID recalculation needed to regain protection. If you don't have the space, this may be an excellent time to explore data archiving to clear off that static (persistent) data so you can improve protection of the primary data store and increase your efficiency in data backup and application performance.
Use RAID 6 with caution.
Dual-parity RAID or RAID 6 can be a solution, but it requires care. RAID 6 will improve rebuild times over the short term, but as drive capacities reach 2 TB or greater, the problem will return. Even with today's technology, an eight-hour rebuild window may not be acceptable. An additional consideration that must be taken into account is that a RAID rebuild is stressful on the remaining drives in the array and the chance of a second failure during the rebuild window increases. If that occurs, then you compound your risks for additional failure and the integrity of the data you're trying to protect.
Consider continuous data protection.
Continuous data protection (CDP) or near-CDP allows for an active copy of your critical data to be available on a separate physical array. While it's not instant protection (every 30 minutes, for example), near-CDP provides greater flexibility because it's not as reliant as CDP on the speed of the disk. With near-CDP, the second copy could be made to a SATA array, which would save costs while ensuring access to data in an active state. With data in an active state it can be accessed directly without having to go through a recovery process. Look for a near-CDP solution that integrates into the backup process to keep process management costs down. Not only are RAID rebuild times a problem but so are recovery times. Using an active target with CDP allows for in-place recovery. The result is that no data movement is required to bring a server or application back online.
Have a reliable backup.
It may sound old-fashioned, but sometimes everything goes wrong and the old standby has to work. Make sure that if you have to resort to your data backups that this recovery method will work. Backup to disk can speed not only the recovery but also the reliability of the backup. If all your other safety nets fail, you want this last one to work.
Rebuilding RAID 5 arrays is going to continue to take longer as the capacities of the individual drives continue to grow. Some of the alternate strategies described here may prove to be faster and easier methods to get back online.
About the author: George Crump is Founder of Storage Switzerland, an analyst firm focused on the virtualization and storage marketplaces. An industry veteran of over 25 years, he has held engineering and sales positions at various IT industry manufactures and integrators. Prior to Storage Switzerland, George was CTO at one of the nation's largest integrators.