All hardware eventually fails. This is one of the painful side effects of entropy in our universe. For most of the types of hardware used in modern infrastructure, the loss of a single component usually incurs some amount of downtime. Other than the time taken to swap out something like a bad CPU or stick of RAM, sysadmins or users rarely see many long term ill-effects. But unless an admin takes particular care with storage, data loss from disk failures can have immediate and lasting consequences.
Take a user’s desktop as an example: If they store their data locally on a single drive, then when the drive inevitably fails, their data will be lost. The same is true no matter the quality, brand, or type of drive. Of course, there are data recovery outfits that would be happy to take hard-earned cash in exchange for the possibility of resurrecting bits from dead drives. Unfortunately, the cost quickly becomes exorbitant, and even those specialists fall short at some point.
Administrators have a number of options at their disposal to fend off looming disaster: RAID, backups, clusters of networked storage, etc. Often these options are used in conjunction to provide layers of data protection and multiple opportunities to stop an issue before it becomes too late. Building redundant arrays of disks and abstracting the storage away from single drives is the simplest and best way to remove these single points of failure. Avoiding late nights and long weekends restoring from backups (that hopefully, someone has been making), or paying the extreme fees to recovery companies, is the goal.
What is RAID?
Redundant Arrays of Inexpensive Disks (RAID) is one of the most widely used and effective storage technologies a sysadmin will come across. Being comfortable with its most common implementations is vital. RAID can be offered as a software solution through an operating system utility like
mdadm in Linux, a hardware RAID controller like the MegaRAID line of cards, or even chipsets that give pseudo-RAID capabilities. Hardware controllers like those in the MegaRAID line should not be confused with host bus adapters (HBAs) though, they are designed for simple and direct access to disks. HBAs exist as a way to provide connectivity without the intelligence of the RAID controller and are subsequently much less expensive.
At a high level, the concept of RAID is grouping a collection of drives into an array to write data across them. Depending on the configuration, the data will be written in different ways, with different amounts of parity information to help rebuild the data in case of a drive failure. While it’s possible to use different types, speeds, sizes, or connections for drives in an array, it’s best to make them match as much as possible. Differently-sized drives almost always end up carved down to the lowest common denominator, and drives of different speeds have to wait on the slowest.
Many admins do prefer to buy drives from different manufacturers, though, to avoid bad batches of drives causing concurrent failures across members of arrays.
Because RAID configurations are named in levels, the numbering scheme implies a linear scale of progression from one configuration to another, even though many of the levels are unrelated to each other. Each RAID level has pros and cons, and some levels are more useful than others. In the real world, the most common levels are 0, 1, 5, 6, 10, 50, and 60. RAID levels 2, 3, 4, and a few others also exist but are proprietary, obsolete or rarely ever used. That may sound like a lot, but when broken down this information becomes more easily digested.
Most RAID levels fit a certain use case. Starting with RAID 0, we find that it is built with no internal redundancy in mind, as each disk provides its full capacity to the array as usable storage. Because data is split up and written across all disks in parallel, we see a benefit. When reads and writes are performed on an array configured this way, they can be very fast, as it scales linearly to the number of disks included in the array.
Technically, though you can make a single disk into a RAID 0 array, you’d really be doing this with at least one pair of disks. The major downside to RAID 0, in general, is that if any single disk goes missing, the whole array will fail, and the data will be lost. This configuration is not suitable for production use where the data doesn’t live on another easily accessible system. RAID 0 can be a perfectly reasonable setup for an end-user workstation that needs high performance, though, where that workstation is not the only home for the data being worked on.
RAID 1 was designed with a totally different goal from RAID 0. Instead of striping data across a set of drives for speed without any protection, RAID 1 gives an administrator the ability to mirror data across two or more drives for resiliency. This RAID level does this to provide a local copy (or copies) of data to aid against a single drive failing, and it uses data from healthy drives to rebuild data after being replaced.
Usually, RAID 1 mirrors consist of a pair of drives, but they can contain three or more, depending on how many copies of blocks the admin requires to have online. What’s important to point out is that this is not a backup. This data exists as a live copy of the drive in a system and does not provide the safeguards of a regular backup system. These mirrors are 1:1 clones, so the drives need to be the same size, or space will be forfeited to accommodate the smallest drive in the set.
Regardless of the number of disks added to a RAID 1 array, the total capacity stays the same. That capacity is the size of a single disk in the array (the smallest, if they’re not identical), but the number of copies of data increases with each additional disk without an increase in overall capacity. Each disk is another clone of the data, providing further protection from individual drive failure.
There are limits on how many disks can be added to an array, based on the software being used, and/or the controller they are attached to.
RAID 2 through 4
RAID levels 2, 3, and 4 are obsolete, proprietary, or very rare. It is unlikely that many sysadmins will run across systems running any of these three configurations, and under normal circumstances, these can effectively be ignored. If you find yourself working on a system running any of these, your best bet is to read the vendor’s documentation to find out how best to manage it.
Beyond using RAID 0 to stripe data across a collection of drives without protection, or using RAID 1 to get some redundancy but limiting capacity, RAID 5 offers a great middle-ground with writing data across multiple drives while providing a level of redundancy to the array. RAID 5 does this by writing parity information to every drive so it can rebuild the data from any single drive.
When using RAID 5, a new requirement comes into play, in that the array must include at least three disks. The capacity is then equal to the sum total of the disks, minus the size of one of them. For example, a RAID 5 with seven 2TB disks ends up being 12TB (7 x 2 is 14, and minus a drive is 12).
When one of those disks ends up failing, an administrator can swap it out and have the system rebuild the replacement with data from the rest of the array, using the previously mentioned parity information. There are two main downsides to this configuration. First, there’s a hit to writing performance (there is overhead in writing all of those extra bits of parity information while writing the real data). Second, during a rebuild, the array is vulnerable to total loss if one of the healthy drives also fails. Depending on the machine’s workload, a rebuild could create a sudden spike to the activity of those drives, and end up pushing one of the healthy drives into failure, too. This concern is why, presently, many admins opt for RAID 6.
A natural evolution of RAID 5, RAID 6 takes the same basic concept and extends the "single drive" of parity information to a pair of drives. While the entirety of individual drives isn’t used for parity, the overall capacity of the drives is used across the array, and RAID 6 uses two drives’ worth of space to hold the parity bits.
Using an additional disk’s worth of space subsequently means the minimum number of disks for a RAID 6 array goes up to four. This seemingly simple change can mean a world of good when it comes to rebuilding a failed drive and still running an array. You can feel comfortable that an additional failure won’t mean a total loss of the data living locally on the machine.
Beyond levels 0, 1, 5, and 6, we find ourselves with the idea of nesting levels of RAID together to create novel configurations that offer new options for storage. The most prevalent and beneficial are 10, 50, and 60; each being a combination of 1, 5, and 6 plus 0, respectively.
A combination of 1 and 0 may sound like it should have just been RAID 5 all over again, but the best way to think of these nested levels is in two dimensions. For RAID 10 we take multiple RAID 1 arrays and stripe across them as if those arrays were disks, creating a RAID 0 array out of them. Because of this factor, RAID 10 requires at least four disks: two for a mirror, and a pair of those mirrors. What we get is an array with speed like a RAID 0 but benefits from the internal redundancy of a RAID 1. A RAID 10 array would only fail when one of the internal RAID 1 arrays fails.
In each RAID 1 pair (or multiple-mirror if an admin so chooses), recovery is possible when disks need to be replaced, so a whole set of RAID 1 members would need to fail to for the RAID 10 itself to succumb to data loss. Rebuilds are also different. In RAID 5, data must be read from all drives in the array to calculate new bits from the parity that’s been previously written. RAID 10, since it’s using RAID 1, reads from the clone(s) of the failed drive to rebuild it.
Like RAID 10, RAID 50 gives us the option to create a fast array from redundant ones. We end up with a RAID 0 encompassing a number of RAID 5 arrays, similar to how RAID 10 was a set of RAID 1 arrays. This is where we start seeing a lot of disks enter the picture for even the simplest of setups. Since a basic RAID 5 requires three disks, a RAID 50 would require at least a total of six, since it’s at minimum a pair of RAID 5 arrays.
Again, similar to RAID 10, this option is the best of two worlds. RAID 50 gives us extra speed from the addition of more disks added in parallel, while still giving us the internal parity information from the RAID 5 configuration. A RAID 50 can withstand multiple drives failing, as long as they’re not within the same nested RAID 5 array.
At this point, RAID 60 should come as no surprise, as it is the same logical extension of RAID 6 to 60 as RAID 5 to 50 was. The biggest added benefit is the speed gains that can be accomplished, combined with the massive increase in redundancy provided in the multiple copies of parity information, and the narrow scope of failure for each nested array. RAID 60 arrays start at eight drives, since each RAID 6 is at least four, in multiples of at least two arrays.
RAID vs. backups
One of the most commonly espoused sayings in the realm of system administration seems to be: "RAID is not a backup." For new admins or those who don’t spend much time thinking about storage, this fact may not be immediately obvious. It may even seem antagonistic or flat out wrong.
The issue comes from the fact that the redundancy built into RAID configurations is built with the same goals in mind as backups: Fighting against data loss. The reason it’s so important to talk about the difference is not to nitpick, but to remind ourselves that these tools exist to provide us with layers of protection, and by lumping them together we do ourselves a disservice.
RAID exists to provide an immediate, live copy of data to assist a running machine as a crutch as it picks itself back up after it stumbles. On the other hand, backups offer an opportunity to test our ability to restore a machine to a working state or to recover data without needing the machine to be running. Backups give us other benefits that RAID does not as well, including the ability to push copies to multiple places on multiple types of media, and save multiple versions.
RAID and backups fill different roles, but both are important, and neither should be neglected.