RAID
(Redundant Arrays of
Inexpensive Disks)
In 1987, Patterson, Gibson and Katz at
the University of California Berkeley, published a paper entitled “A Case for
Redundant Array of Inexpensive Disks (RAID)”.Described the various types of Disk
Arrays, referred to as the acronym RAID. The basic idea of RAID was to combine
multiple, small inexpensive disks drive into an array of disk drives which
yields performance exceeding that of a Single, Large Expensive
Drive(SLED).Additionally this array of drives appear to the computer as a
single logical storage unit or drive. In a SLED Reliabity becomes a big problem as the data
in an entire disk may be lost. As the number of disks per component increases, the
probability of failure also increases .Suppose a (reliable) disk fails every
100,000 hrs. Reliabity of a disk in an array of N disks = Reliability of 1
disk/ N 100000hrs / 100 = 1000 hrs. = 41.66 days. RAID can improve availability
and throughput (although actually reliability – whether anything is broken –
suffers because of the larger number of disks).Data is stored on several disks
instead of a single disk. It’s a technology that enables
greater levels of performance, reliability and/or large volumes when dealing
with data. Disks are small (physically) and cheap, so it’s easy to put
lots of disks (10s to 100s) in one box for increased storage, performance, and availability.
Data plus some redundant information is striped across the disks in some way. Standard way of organizing disks and
classifying the reliability of multi-disk systems. General methods: data duplication, parity, and
error-correcting codes (ECC).
RAID idea: use redundancy to improve
performance and reliability. Redundant array of cheap disks as one storage
unit. Fast: simultaneous read and write disks in the array
Reliable: use parity to detect and correct
errors
RAID can have different redundancy levels, achieving
different performance and reliability. Seven different RAID levels (0-6). Basic
idea is to connect multiple disks together to provide large storage capacity
faster access to reading data redundant data. Many different levels of RAID
systems differing levels of redundancy, error checking, capacity, and cost.
Where can I use RAID?
v LANs/WANs
v SANs
v Clustering
environments
v Mission
critical installations
v News centers
v Internet News
Servers
v Enterprise
Servers
v Performance
Desktop
v Systems
v PC Workstations
v Workgroup/File
Servers
v E-Mail Servers
v Intranet/Web
Servers
v Application Servers
RAID
Applications
Ø Transfer Rate
Intensive Applications: (Typically RAID 0 environments) RAID
Ø Striping is ideal for transfer rate-intensive
environments
Ø A transfer
rate-intensive environment consists of :
Ø Applications that require a large amount of data to be
processed in a fixed amount of time.
Ø Video playback and video editing are typical transfer rate
intensive environments
Ø Photo processing, manipulation and rendering
Ø Request Rate
Intensive Applications: (Typically RAID 5 environments)
Ø RAID is used in highly multi-tasking, request
rate-intensive environments
Ø A request
rate-intensive environment consists of:
Ø Data bases, file/web servers: -- high number of random
smaller requests.
Ø A RAID drive
can be configured to process each request within a stripe, allowing multiple
requests to be processed in parallel.
Improvement
of Reliability via Redundancy
It has
two parts
1. Mirroring
2. Data Striping.
Mirroring:
Ø It performs following tasks:
Ø Duplicate every disk
Ø Logical disk consists of two physical disks.
Ø Every write is carried out on both disks.
Ø If one of the disk fails, data read from the other
Ø Data permanently lost only if the second disk fails
before the first failed disk is replaced.
Example of mirroring in raid
Suppose mean time to repair is 10 hrs. The
mean time to data loss of a mirrored disk system is
100,000 ^ 2 / (2 * 10) hrs. ~ 57,000
years.
Parallel Disk Systems
• We cannot improve the disk
performance significantly as a single drive
- But many applications require high
performance storage systems?
Solutions:
- Parallel Disk Systems
- Higher Reliability and Higher
data-transfer rate.
DATA
STRIPING
Ø It performs following tasks:
Ø Fundamental to RAID
v A method of concatenating multiple drives into one
logical storage unit.
Ø Splitting the bits of each byte across multiple disks:
bit – level striping
Ø E.g. an array of eight disks, write bit i of each byte
to disk I
Ø Sectors are eight times the normal size
Ø Eight times the access rate
Ø Similarly for blocks of file, block-level striping
AI RAID
LEVELS
The RAPID level defines how the disk
are organized, reliability and performance will be improved.
v Data are distributed across the array of disk drives
v Redundant disk capacity is used to store parity
information, which
v Guarantees data recoverability in case of a disk
failure
v
Levels decided
according to schemes to provide redundancy at lower cost by using
striping and “parity” bits.
Levels 0
This level offers no redundancy – no extra data is kept.
The performance is the best of any level.
Throughput is increased by striping data across several disks. It splits
data among two or more disks. It provides good performance. It has lack of data redundancy means there is no fail over support
with this configuration. In the diagram to the right, the odd blocks are written
to disk 0 and the even blocks to disk 1 such that A1, A2, A3, A4, would be the
order of blocks read if read sequentially from the beginning. Used I read only NFS
systems and gaming systems.
RAID Level 0: Striping
The first
RAID level is actually not a RAID level at all, in that there is no redundancy.
However, RAID level 0, or striping as it is better known, serves as an
excellent upper-bound on performance and capacity and thus is worth
understanding. The simplest form of striping will stripe blocks across the
disks of the system as follows (assume here a 4-disk array):
Disk
0 Disk 1 Disk 2 Disk 3
0 1 2 3
4 5
6 7
8 9 10 11
12 13 14 15
Table
38.1: RAID-0: Simple Striping
From
Table 38.1, you get the basic idea: spread the blocks of the array across the
disks in a round-robin fashion. This approach is designed to extract the most
parallelism from the array when requests are made for contiguous chunks of the
array (as in a large, sequential read, for example).We call the blocks in the
same row a stripe; thus, blocks 0, 1, 2, and 3 are in the same stripe above. In
the example, we have made the simplifying assumption that only 1 block (each of
say size 4KB) is placed on each disk before moving on to the next. However,
this arrangement need not be the case. For example, we could arrange the blocks
across disks as in Table 38.2:
Disk 0 Disk 1 Disk 2 Disk 3
0 2 4 6 chunk size:
1 3 5 7 2 blocks
8 10 12 14
9 11 13 15
Table
38.2: Striping with a Bigger Chunk Size
In this
example, we place two 4KB blocks on each disk before moving on to the next
disk. Thus, the chunk size of this RAID array is 8KB, and a stripe thus
consists of 4 chunks or 32KB of data.
ASIDE: THE RAID MAPPING PROBLEM
Before
studying the capacity, reliability, and performance characteristics of the
RAID, we first present an aside on what we call the mapping problem. This
problem arises in all RAID arrays; simply put, given a logical block to read or
write, how does the RAID know exactly which physical disk and offset to access?
For these simple RAID levels, we do not need much sophistication in order to
correctly map logical blocks onto their physical locations. Take the first
striping example above (chunk size = 1 block = 4KB). In this case, given a
logical block address A, the RAID can easily compute the desired disk and offset
with two simple equations:
Disk = A %
number_of_disks
Offset = A
/ number_of_disks
Note that
these are all integer operations (e.g., 4 / 3 = 1 not 1.33333...).Let’s see how
these equations work for a simple example. Imagine in the first RAID above that
a request arrives for block 14. Given that there are 4 disks, this would mean
that the disk we are interested in is (14 % 4 = 2): disk 2. The exact block is
calculated as (14 / 4 = 3): block 3. Thus, block 14 should be found on the
fourth block (block 3, starting at 0) of the third disk (disk 2, starting at 0),
which is exactly where it is.You can think about how these equations would be
modified to support different chunk sizes. Try it! It’s not too hard.
Chunk
Sizes
Chunk
size mostly affects performance of the array. For example, a small chunk size
implies that many files will get striped across many disks, thus increasing the
parallelism of reads and writes to a single file; however, the positioning time
to access blocks across multiple disks increases, because the positioning time
for the entire request is determined by themaximu of the positioning times of the
requests across all drives. A big chunk size, on the other hand, reduces such
intra-file parallelism, and thus relies on multiple concurrent requests to
achieve high throughput. However, large chunk sizes reduce positioning time;
if, for example, a single file fits within a chunk and thus is placed on a
single disk, the positioning time incurred while accessing it will just be the
positioning time of a single disk. Thus, determining the “best” chunk size is
hard to do, as it requires a great deal of knowledge about the workload
presented to the disk system [CL95]. For the rest of this discussion, we will
assume that the array uses a chunk size of a single block (4KB). Most arrays
use larger chunk sizes (e.g., 64 KB), but for the issues we discuss below, the
exact chunk size does not matter; thus we use a single block for the sake of
simplicity.
Back To
RAID-0 Analysis
Let
us now evaluate the capacity, reliability, and performance of striping. From
the perspective of capacity, it is perfect: given N disks, striping delivers N
disks worth of useful capacity. From the standpoint of reliability, striping is
also perfect, but in the bad way: any disk failure will lead to data loss.
Finally, performance is excellent: all disks are utilized, often in parallel,
to service user I/O requests.
Evaluating
RAID Performance
In
analyzing RAID performance, one can consider two different performance metrics.
The first is single-request latency. Understanding the latency of a single I/O
request to a RAID is useful as it reveals how much parallelism can exist during
a single logical I/O operation. The second is steady-state throughput of the
RAID, i.e., the total bandwidth of many concurrent requests. Because RAIDs are
often used in high-performance environments, the steady-state bandwidth is
critical, and thus will be the main focus of our analyses. To understand
throughput in more detail, we need to put forth some workloads of interest. We
will assume, for this discussion, that there are two types of workloads:
sequential and random. With a sequential workload, we assume that requests to
the array come in large contiguous chunks; for example, a request (or series of
requests) that accesses 1 MB of data, starting at block (B) and ending at block
(B + 1 MB), would be deemed sequential. Sequential workloads are common in many
environments (think of searching through a large file for a keyword), and thus
are considered important. For random workloads, we assume that each request is
rather small, and that each request is to a different random location on disk.
For example, a random stream of requests may first access 4KB at logical
address 10, then at logical address 550,000, then at 20,100, and so forth. Some
important workloads, such as transactional workloads on a database management system
(DBMS), exhibit this type of access pattern, and thus it is considered an
important workload. Of course, real workloads are not so simple, and often have
a mix of sequential and random-seeming components as well as behaviors in
between the two. For simplicity, we just consider these two possibilities. As
you can tell, sequential and random workloads will result in widely different
performance characteristics from a disk. With sequential access, a disk
operates in its most efficient mode, spending little time seeking and waiting
for rotation and most of its time transferring data. With random access, just
the opposite is true: most time is spent seeking and waiting for rotation and
relatively little time is spent transferring data. To capture this difference
in our analysis, we will assume that a disk can transfer data at S MB/s under a
sequential workload, and R MB/s when under a random workload. In general, S is
much greater than Rotor make sure we understand this difference, let’s do a
simple exercise. Specifically, let’s calculate S and R given the following disk
characteristics. Assume a sequential transfer of size 10 MB on average, and a random
transfer of 10 KB on average. Also, assume the following disk
Characteristics:
Average
seek time 7 ms. Average rotational delay
3 ms.Transfer rate of disk 50 MB/s. To compute S, we need to first figure out
how time is spent in a typical 10 MB transfer. First, we spend 7 ms seeking,
and then 3 ms rotating. Finally, transfer begins; 10 MB @ 50 MB/s leads to
1/5th of a second, or 200 ms, spent in transfer. Thus, for each 10 MB request,
we spend 210 ms completing the request. To compute S, we just need to divide:
S =
Amount of Data
Time
to access = 10 MB
210
ms = 47.62 MB/s
As we
can see, because of the large time spent transferring data, S is very near the
peak bandwidth of the disk (the seek and rotational costs have been amortized).We
can compute R similarly. Seek and rotation are the same; we then compute the
time spent in transfer, which is 10 KB @ 50 MB/s, or 0.195ms.
R =
Amount of Data
Time to
access = 10 KB
10.195 ms
= 0.981 MB/s
As we can
see, R is less than 1 MB/s, and S/R is almost 50.
Back To
RAID-0 Analysis, Again
Let’s now
evaluate the performance of striping. As we said above, it is generally good.
From a latency perspective, for example, the latency of a single-block request
should be just about identical to that of a single disk; after all, RAID-0 will
simply redirect that request to one of its disks. From the perspective of
steady-state throughput, we’d expect to get the full bandwidth of the system.
Thus, throughput equals N (the number of disks) multiplied by S (the sequential
bandwidth of a single disk). For a large number of random I/Os, we can again
use all of the disks, and thus obtain N · R MB/s. As we will see below, these
values are both the simplest to calculate and will serve as an upper bound in
comparison with other RAID levels.
RAID Level 1
Ø It performs following tasks:
Ø A complete file is
stored on a single disk.
Ø A second disk contains an exact copy of the file
Ø Provides complete redundancy of data
Ø Read performance can be improved
Ø file data can be read in parallel
Ø Write performance suffers must write
the data out twice
Ø Most expensive RAID implementation
Ø requires twice as much storage space
RAID Level
1: Mirroring
Our first
RAID level beyond striping is known as RAID level 1, or mirroring. With a mirrored system, we simply make more than
one copy of each block in the system; each
copy should be placed on a separate disk,
of course. By doing so, we can tolerate disk failures. In a typical mirrored
system, we will assume that for each logical block,
the RAID keeps two physical copies of it. Here is an example:
Disk 0 Disk 1 Disk 2 Disk 3
0 0 1 1
2 2 3 3
4 4 5 5
6 6 7 7
Table
38.3: Simple RAID-1: Mirroring
In the
example, disk 0 and disk 1 have identical contents, and disk 2 and disk 3 do as
well; the data is striped across these mirror pairs. In fact, you may have
noticed that there are a number of different ways to place block copies across
the disks. The arrangement above is a common one and is sometimes called
RAID-10 or (RAID 1+0) because it uses mirrored pairs (RAID-1) and then stripes
(RAID-0) on top of them; another common arrangement is RAID-01 (or RAID 0+1),
which contains two large striping (RAID-0) arrays, and then mirrors (RAID-1) on
top of them. For now, we will just talk about mirroring assuming the above
layout. When reading a block from a mirrored array, the RAID has a choice: it can
read either copy. For example, if a read to logical block 5 is issued to the
RAID, it is free to read it from either disk 2 or disk 3. When writing a block,
though, no such choice exists: the RAID must update both copies of the data, in
order to preserve reliability. Do note, though, that these writes can take
place in parallel; for example, a write to logical block 5 could proceed to
disks 2 and 3 at the same time.
RAID-1 Analysis
Let us
assess RAID-1. From a capacity standpoint, RAID-1 is expensive; with the
mirroring level = 2, we only obtain half of our peak useful capacity. Thus,
with N disks, the useful capacity of mirroring is N/2. From a reliability
standpoint, RAID-1 does well. It can tolerate the failure of any one disk. You
may also notice RAID-1 can actually do better than this, with a little luck.
Imagine, in the figure above, that disk 0 and disk 2 both failed. In such a
situation, there is no data loss! More generally, a mirrored system (with
mirroring level of 2) can tolerate 1 disk failure for certain, and up to N/2
failures depending on which disks fail. In practice, we generally don’t like to
leave things like this to chance; thus most people consider mirroring to be
good for handling a single failure. Finally, we analyze performance. From the
perspective of the latency of a single read request, we can see it is the same
as the latency on a single disk; all the RAID-1 does is direct the read to one
of its copies. A write is a little different: it requires two physical writes
to complete before it is done. These two writes happen in parallel, and thus
the time will be roughly equivalent to the time of a single write; however,
because the logical write must wait for both physical writes to complete, it
suffers the worst-case seek and rotational delay of the two requests, and thus
(on average) will be slightly higher than a write to a single disk.
ASIDE: THE
RAID CONSISTENT-UPDATE PROBLEM
Before
analyzing RAID-1, let us first discuss a problem that arises in any multi-disk RAID system, known as the consistent-update
problem [DAA05]. The problem occurs on a
write to any RAID that has to update multiple disks during a single logical
operation. In this case, let us assume we are
considering a mirrored disk array. Imagine the write is issued to the RAID, and
then the RAID decides that it must be
written to two disks, disk 0 and disk 1. The RAID then issues the write to disk 0, but just before the RAID can issue the
request to disk 1, a power loss (or system crash)
occurs. In this unfortunate case, let us assume
that the request to disk 0 completed (but clearly the request to disk 1 did not, as it was never issued).The result of this
untimely power loss is that the two copies of the block are now inconsistent; the copy on disk 0 is the new version,
and the copy on disk 1 is the old. What we would
like to happen is for the state of both disks
to change atomically, i.e., either both should end up as the new version or neither. The general way to solve this problem is
to use a write-ahead log of some kind
to first record what the RAID is about to do (i.e., update two disks with a certain piece of data) before doing it. By taking this
approach, we can ensure that in the presence of a
crash, the right thing will happen; by running
a recovery procedure that replays all pending transactions to the RAID, we can ensure that no two mirrored copies (in the
RAID-1 case)are out of sync. One last note: because logging to disk on every
write is prohibitively expensive, most RAID hardware
includes a small amount of non-volatile RAM
(e.g., battery-backed) where it performs this type of logging. Thus, consistent update is provided without the high cost of
logging to disk. To analyze steady-state throughput, let us start with the
sequential workload. When writing out to disk
sequentially, each logical write must result
in two physical writes; for example, when we write logical block 0 (in the figure above), the RAID internally would write it
to both disk 0 and disk 1. Thus, we can conclude
that the maximum bandwidth obtained during sequential writing to a mirrored
array is (N2 · S), or half the peak band width.
Unfortunately, we obtain the exact same performance during a sequential read. One might think that a sequential read could do better,
because it only needs to read one copy of the data, not both. However, let’s use
an example to illustrate why this doesn’t help much. Imagine we need to read blocks 0, 1, 2, 3, 4, 5, 6, and 7. Let’s say we
issue the read of 0 to disk 0, the read of 1 to disk 2,
the read of 2 to disk 1, and the read of 3 to
disk 3. We continue by issuing reads to 4, 5, 6, and 7 to disks 0, 2, 1 and 3,
respectively. One might naively think that because we are utilizing all disks, we are achieving the full bandwidth of the array.
To see that this is not the case, however, consider the requests a single disk
receives (say disk 0). First, it gets a request for block 0; then, it gets a request for block 4 (skipping block 2). In fact, each disk
receives a request for every other block. While it is
rotating over the skipped block, it is not
delivering useful bandwidth to the client. Thus, each disk will only deliver half its peak bandwidth. And thus, the sequential
read will only obtain a bandwidth of (N2 · S) MB/random
reads are the best case for a mirrored RAID. In this case, we can distribute the reads across all the disks, and thus
obtain the full possible
bandwidth. Thus, for
random reads, RAID-1 delivers N · R MB/finally, random writes performs you might
expect: N2 ·RMB/s. Each logical write must turn into two
physical writes, and thus while all the disks
will be in use, the client will only perceive this as half the available bandwidth. Even though a write to logical block X turns into
two parallel writes to two different physical
disks, the bandwidth of many small requests only
achieves half of what we saw with striping. As we will soon see, getting half the available bandwidth is actually pretty
good!
RAID Level
2
It
performs following tasks
• Uses
Hamming (or any other) error-correcting code (ECC)
• Intended
for use in drives which do not have built-in error detection
• Central
idea is if one of the disks fail the remaining bits of the byte and the
associated ECC bits can be used to reconstruct the data
• Not very
popular
•
Bit-level
Striping with Hamming (ECC) codes for error correction
•
All
7 disk arms are synchronized and move in unison
•
Complicated
controller
•
Single
access at a time
•
Tolerates
only one error, but with no performance degradation
Memory-Style
ECC (RAID Level 2)
Ø Some disks in array are used to hold
ECC
Ø Using Hamming codes as the ECC
Ø
Correct one bit
error in a 4 bits code word requires 3 redundant bits.
Raid Level 3
Improves upon RAID 2, known as
Bit-Interleaved Parity
• Disk Controllers can detect whether
a sector has been read correctly.
• Storage overhead is reduced – only
1 parity disk
• Expense of computing and writing
parity
• Need to include a dedicated parity
hardware
RAID 3: Bit-Interleaved Parity
l Reads and writes go to all disks in a
group, with one extra disk to hold the check information in case there is a
failure. Parity is simply the sum of the data in all the disks modulo 2. Lost data can be reconstructed by examining
the parity. Every access goes to all disks.
l One disk in the array stores parity
for the other disks
¡ Enough to correct the error when the
disk controller tells which disk fails.
+ More efficient that Levels 1 and 2
- Parity disk doesn’t add bandwidth
Raid Level 4
Stripes data at a block level across several drives, with parity stored
on one drive - block-interleaved parity
• Allows
recovery from the failure of any of the disks
• Performance
is very good for reads
• Writes
require that parity data be updated each time. Slows small random writes but
large writes are fairly fast
RAID Level 4: Saving Space with Parity
We now present a different method of adding
redundancy to a disk array known as parity. Parity-based approaches attempt to
use less capacity and thus overcome the huge space penalty paid by mirrored
systems. They do so at a cost, however: performance. In a five-disk RAID-4
system, we might observe the following layout:
Disk 0 Disk
1 Disk 2 Disk 3 Disk 4
0 1 2 3 P0
4 5 6 7 P1
8 9 10 11 P2
12 13 14 15 P3
As you can see, for each stripe of data, we
have added a single parity block that
stores the redundant information for that stripe of blocks. For example, parity block P1 has redundant
information that it calculated from blocks
4, 5, 6, and 7.To compute parity, we need to use a mathematical function that
enables us to withstand the loss of any one block
from our stripe. It turns out the
simple function XOR does the trick quite nicely. For a given set of bits, the XOR of all of those bits returns
a 0 if there are an even number of 1’s in the
bits, and a 1 if there are an odd number of 1’s. For example:
C0 C1 C2 C3 P
0 0 1 1 XOR (0, 0,1, 1) = 0
0 1 0 0
XOR (0, 1,0,0) = 1
In the first row (0,0,1,1), there are two 1’s (C2, C3),
and thus XOR of all of those values will be 0 (P); similarly, in the second row
there is only one 1 (C1), and thus the XOR must be 1 (P). You can remember this
in a very simple way: that the number of 1’s in any row must be an even (not odd) number; that is the invariant
that the RAID must maintain in order for parity to be correct. From the example
above, you might also be able to guess how parity information can be used to
recover from a failure. Imagine the column labeled C2 is lost. To figure out
what values must have been in the column, we simply have to read in all the
other values in that row (including the XOR’d parity bit) and reconstruct the right answer. Specifically, assume the
first row’s value in
column C2 is lost (it is a 1); by reading the other values in that row (0
fromC0, 0 fromC1, 1 fromC3, and 0 from the parity column P), we get the values
0, 0, 1, and 0. Because we know that XOR keeps an even number of 1’s in each
row, we know what the missing data must be: a 1. And that is how reconstruction
works in a XOR-based parity scheme! Note also how we compute the reconstructed
value: we just XOR the data bits and the parity bits together, in the same way
that we calculated the parity in the first place. Now you might be wondering:
we are talking about XORing all of these bits, and yet above we know that the
RAID places 4KB (or larger) blocks on each disk; how do we apply XOR to a bunch
of blocks to compute the parity? It turns out this is easy as well. Simply
perform a bitwise XOR across each bit of the data blocks; put the result of
each bitwise XOR into the corresponding bit slot in the parity block. For
example, if we had blocks of size 4 bits (yes, this is still quite a bit
smaller than a 4KB block, but you get the picture), they might look something
like this:
Block0 Block1
Block2 Block3 Parity
00 10 11 10 11
10 01 00 01 10
As you can see from the figure, the parity is computed for each bit of each
block and the result placed in the parity block.
RAID-4 Analysis
Let us now analyze RAID-4. From a capacity
standpoint, RAID-4 uses 1 disk for
parity information for every group of disks it is protecting. Thus, our useful capacity for a RAID group is
(N-1). Reliability is also quite easy to
understand: RAID-4 tolerates 1 disk failure and no more. If more than one disk is lost, there is simply no
way to reconstruct the lost data. Finally, there is performance. This time,
let us start by analyzing steady state throughput. Sequential read performance can utilize all of the disks except for the parity disk, and thus
deliver a peak effective bandwidth of(N − 1) · S MB/s (an
easy case).To understand the performance of sequential writes, we must first
understand how they are
done. When writing a big chunk of data to disk, RAID-4 can perform a simple
optimization known as a full-stripe write. For example, imagine the case where
the blocks 0, 1, 2, and 3 have been sent to the RAID as part of a write request (Table 38.4).
Disk 0 Disk 1 Disk 2 Disk 3 Disk 4
0 1 2 3 P0
4 5 6 7 P1
8 9 10 11 P2
12 13 14 15 P3
Table 38.4: Full-stripe writes in RAID-4
In this case, the RAID can simply calculate
the new value of P0 (by performing an XOR across the blocks 0, 1, 2, and 3) and
then write all of the blocks (including the parity block) to the five disks
above in parallel (highlighted in gray in the figure). Thus, full-stripe writes
are the most efficient way for RAID-4 to write to disk. Once we understand the
full-stripe write, calculating the performance of sequential writes on RAID-4 is
easy; the effective bandwidth is also (N −1) ·S MB/s. Even
though the parity disk is constantly in use during the operation, the client
does not gain performance advantage from it.Now let us analyze the performance of
random reads. As you can also see from the figure above, a set of 1-block
random reads will be spread across the data disks of the system but not the
parity disk. Thus, the effective performance is: (N − 1) · R MB/s. Random
writes, which we have saved for last, present the most interesting case for
RAID-4. Imagine we wish to overwrite block 1 in the example above. We could
just go ahead and overwrite it, but that would leave us with a problem: the
parity block P0 would no longer accurately reflect the correct parity value for
the stripe. Thus, in this example, P0 must also be updated. But how can we
update it both correctly and efficiently? It turns out there are two methods.
The first, known as additive parity, requires us to do the following. To compute
the value of the new parity block, read in all of the other data blocks in the
stripe in parallel (in the example, blocks 0, 2, and 3) and XOR those with the
new block (1). The result is your new parity block. To complete the write, you
can then write the new data and new parity to their respective disks, also in parallel.
The problem with this technique is that it scales with the number of disks, and
thus in larger RAIDs requires a high number of reads to compute parity. Thus,
the subtractive parity method. For example, imagine this string of bits (4 data
bits, one parity):
C0 C1 C2 C3
P
0 0 1 1 XOR (0,0,1,1) = 0
Let’s imagine that we wish to overwrite bit C2 with a new value which we
will call C2 (new). The subtractive method works in three steps. First, we read
in the old data at C2 (C2 (old) = 1) and the old parity (P (old) = 0). Then, we
compare the old data and the new data; if they are the same (e.g., C2 (new) = C2
(old)), then we know the parity bit will also remain the same (i.e., P (new) = P
(old)). If, however, they are different, then we must flip the old parity bit
to the opposite of its current state, that is, if (P(old) == 1), P(new) will be
set to 0; if (P(old) == 0), P(new) will be set to 1. We can express this whole
mess neatly with XOR as it turns out (if you understand XOR, this will now make
sense to you): P (new) = (C (old) XOR C (new)) XOR P (old) Because we are
dealing with blocks, not bits, we perform this calculation over all the bits in
the block (e.g., 4096 bytes in each block multiplied by 8 bits per byte). Thus,
in most cases, the new block will be different than the old block and thus the
new parity block will too.You should now be able to figure out when we would
use the additive parity calculation and when we would use the subtractive method.
Think about how many disks would need to be in the system so that the additive method
performs fewer I/Os than the subtractive method; what is the cross-over point?
For this performance analysis, let us assume we are using the subtractive method.
Thus, for each write, the RAID has to perform 4 physical I/Os (two reads and
two writes). Now imagine there are lots of writes submitted to the RAID; how
many can RAID-4 perform in parallel? To understand, let us again look at the
RAID-4 layout (Figure 38.5).
Disk 0 Disk 1 Disk 2 Disk 3 Disk 4
0 1 2 3 P0
4 5 6 7 +P1
8 9 10 11 P2
12 13 14 15 +P3
Table 38.5: Example: Writes To 4, 13, and Respective Parity Blocks
Now imagine there were 2 small writes
submitted to the RAID-4 at about the same time, to blocks 4 and 13 (marked
with in the diagram).The data for those
disks is on disks 0 and 1, and thus the read and write to data could happen in
parallel, which is good. The problem that arises is with the parity disk; both
the requests have to read the related parity blocks for 4 and 13, parity blocks
1 and 3 (marked with +). Hopefully, the issue is now clear: the parity disk is
a bottleneck under this type of workload; we sometimes thus call this the small-write
problem for parity based RAIDs. Thus, even though the data disks could be
accessed in parallel, the parity disk prevents any parallelism from
materializing; all writes to the system will be serialized because of the
parity disk. Because the parity disk has to perform two I/Os (one read, one
write) per logical I/O, we can compute the performance of small random writes
in RAID-4 by computing the parity disk’s performance on those two I/Os, and thus we achieve (R/2) MB/s. RAID-4
throughput under random small writes is terrible; it does not improve as you
add disks to the system. We
conclude by analyzing I/O latency in RAID-4. As you now know, a single read
(assuming no failure) is just mapped to a single disk, and thus its latency is equivalent to the
latency of a single disk request. The latency of a single write requires two
reads and then two writes; the reads can happen in parallel, as can the
writes, and thus total latency is about twice that of a single disk (with some
differences because we have to wait for both reads to complete and thus get
the worst-case positioning time, but then the updates don’t incur seek
cost and thus may be a better-than average positioning cost).
RAID Level
5: Rotating Parity
To address the small-write problem (at
least, partially), Patterson, Gibson, and Katz introduced RAID-5. RAID-5 works
almost identically to RAID-4,
except that it rotates the parity block across drives (Figure 38.6).
Disk 0 Disk 1 Disk 2 Disk 3 Disk 4
0 1 2 3 P0
5 6 7 P1 4
10 11 P2 8 9
15 P3 12 13 14
P4 16 17 18 19
Table 38.6: RAID-5With Rotated Parity
As you can see, the parity block for each stripe is now rotated across the
disks, in order to remove the parity-disk bottleneck for RAID-4.
RAID-5 Analysis
Much of the analysis for RAID-5 is
identical to RAID-4. For example, the effective capacity and failure tolerance
of the two levels are identical. So are sequential read and write performance.
The latency of a single request (whether a read or a write) is also the same as
RAID-4.Randomread performance is a little better, because we can utilize all of the disks. Finally, random write
performance improves noticeably overRAID-4, as it allows for parallelism across
requests. Imagine a write to block 1 and
a write to block 10; this will turn into requests to disk 1 and disk 4 (for block 1 and its parity) and
requests to disk 0 and disk 2 (for block 10 and its parity). Thus, they can proceed in parallel. In fact,
we can generally assume that that given a
large number of random requests, we will be able to keep all the disks about
evenly busy. If that is the case, then our total bandwidth for small writes
will be N4 · R MB/s. The
factor of four loss is due to the fact that each
RAID-5 write still generates 4 total I/O operations, which is simply the cost of using parity-based
RAID.REDUNDANT ARRAYS OF INEXPENSIVE DISKS (RAIDS) 15
RAID-0
RAID-1 RAID-4 RAID-5
Capacity N N/2
N − 1 N − 1
Reliability
0 1(for
sure) 1 1
N/2 (if lucky)
Throughput
Sequential Read N · S (N/2)
· S (N − 1) · S (N − 1) · S
Sequential Write N · S (N/2) · S (N − 1) · S (N − 1) · S
Random Read N · R N · R
(N − 1) · R N · R
Random Write N · R (N/2) · R 1/2· R N/4 R
Latency
Read
D D D
D
Write D D 2D 2D
Table 38.7: RAID Capacity, Reliability, and
Performance
Because RAID-5 is basically identical to
RAID-4 except in the few cases where it is better, it has almost completely replaced
RAID-4 in them market place. The only place where it has not is in systems that
know they will never perform anything other than a large write, thus avoiding the
small write problem altogether [HLM94]; in those cases, RAID-4 is sometimes used
as it is slightly simpler to build.
Raid level 6
For raid 6 array with n disks,
Ø Data is
divided into n-2 strips
Ø Two pieces
of parity information are collected over the n-2 strips(using a Reed-Solomon
code)
Ø The parity
information is collected over the disk
Raid Advantage
Ø RAID provides an environment of highly
reliable, fault tolerant disk drive sub-systems
Ø RAID adds life to disk drives as the
controller manages the I/O load equally amongst all the drives in the Array reducing the risk of
“single point-of-failure”
Ø A “successful” RAID environment includes
reliable robust disk drives
Ø Each RAID environment will include multiple disk drives
Ø Mylex PCI RAID controllers can attach as
many as 45 drives per RAID controller, up to 16 controllers per system for an astonishing -- 720 disk drives per system configuration
Ø Mylex external RAID controllers can attach
as many as 90 drives per storage subsystem and can reside as part of a SAN
(Storage Area
Ø Network)
Ø Capacities range from terabytes to
petabytes!
REFERENCE:
PPT+BOOK (Redundant
Arrays of Inexpensive Disks RAIDs)
Stripes data at a block level across several
• Allows recovery from the failure of any of the disks
• Performance is very good for reads
• Writes require that parity data be updated each time. Slows
small
butlarge writes are fairly fast
Stripes data a RAID
(Redundant Arrays of
Inexpensive Disks)
In 1987, Patterson, Gibson and Katz at
the University of California Berkeley, published a paper entitled “A Case for
Redundant Array of Inexpensive Disks (RAID)”.Described the various types of Disk
Arrays, referred to as the acronym RAID. The basic idea of RAID was to combine
multiple, small inexpensive disks drive into an array of disk drives which
yields performance exceeding that of a Single, Large Expensive
Drive(SLED).Additionally this array of drives appear to the computer as a
single logical storage unit or drive. In a SLED Reliabity becomes a big problem as the data
in an entire disk may be lost. As the number of disks per component increases, the
probability of failure also increases .Suppose a (reliable) disk fails every
100,000 hrs. Reliabity of a disk in an array of N disks = Reliability of 1
disk/ N 100000hrs / 100 = 1000 hrs. = 41.66 days. RAID can improve availability
and throughput (although actually reliability – whether anything is broken –
suffers because of the larger number of disks).Data is stored on several disks
instead of a single disk. It’s a technology that enables
greater levels of performance, reliability and/or large volumes when dealing
with data. Disks are small (physically) and cheap, so it’s easy to put
lots of disks (10s to 100s) in one box for increased storage, performance, and availability.
Data plus some redundant information is striped across the disks in some way. Standard way of organizing disks and
classifying the reliability of multi-disk systems. General methods: data duplication, parity, and
error-correcting codes (ECC).
RAID idea: use redundancy to improve
performance and reliability. Redundant array of cheap disks as one storage
unit. Fast: simultaneous read and write disks in the array
Reliable: use parity to detect and correct
errors
RAID can have different redundancy levels, achieving
different performance and reliability. Seven different RAID levels (0-6). Basic
idea is to connect multiple disks together to provide large storage capacity
faster access to reading data redundant data. Many different levels of RAID
systems differing levels of redundancy, error checking, capacity, and cost.
Where can I use RAID?
v LANs/WANs
v SANs
v Clustering
environments
v Mission
critical installations
v News centers
v Internet News
Servers
v Enterprise
Servers
v Performance
Desktop
v Systems
v PC Workstations
v Workgroup/File
Servers
v E-Mail Servers
v Intranet/Web
Servers
v Application Servers
RAID
Applications
Ø Transfer Rate
Intensive Applications: (Typically RAID 0 environments) RAID
Ø Striping is ideal for transfer rate-intensive
environments
Ø A transfer
rate-intensive environment consists of :
Ø Applications that require a large amount of data to be
processed in a fixed amount of time.
Ø Video playback and video editing are typical transfer rate
intensive environments
Ø Photo processing, manipulation and rendering
Ø Request Rate
Intensive Applications: (Typically RAID 5 environments)
Ø RAID is used in highly multi-tasking, request
rate-intensive environments
Ø A request
rate-intensive environment consists of:
Ø Data bases, file/web servers: -- high number of random
smaller requests.
Ø A RAID drive
can be configured to process each request within a stripe, allowing multiple
requests to be processed in parallel.
Improvement
of Reliability via Redundancy
It has
two parts
1. Mirroring
2. Data Striping.
Mirroring:
Ø It performs following tasks:
Ø Duplicate every disk
Ø Logical disk consists of two physical disks.
Ø Every write is carried out on both disks.
Ø If one of the disk fails, data read from the other
Ø Data permanently lost only if the second disk fails
before the first failed disk is replaced.
Example of mirroring in raid
Suppose mean time to repair is 10 hrs. The
mean time to data loss of a mirrored disk system is
100,000 ^ 2 / (2 * 10) hrs. ~ 57,000
years.
Parallel Disk Systems
• We cannot improve the disk
performance significantly as a single drive
- But many applications require high
performance storage systems?
Solutions:
- Parallel Disk Systems
- Higher Reliability and Higher
data-transfer rate.
DATA
STRIPING
Ø It performs following tasks:
Ø Fundamental to RAID
v A method of concatenating multiple drives into one
logical storage unit.
Ø Splitting the bits of each byte across multiple disks:
bit – level striping
Ø E.g. an array of eight disks, write bit i of each byte
to disk I
Ø Sectors are eight times the normal size
Ø Eight times the access rate
Ø Similarly for blocks of file, block-level striping
AI RAID
LEVELS
The RAPID level defines how the disk
are organized, reliability and performance will be improved.
v Data are distributed across the array of disk drives
v Redundant disk capacity is used to store parity
information, which
v Guarantees data recoverability in case of a disk
failure
v
Levels decided
according to schemes to provide redundancy at lower cost by using
striping and “parity” bits.
Levels 0
This level offers no redundancy – no extra data is kept.
The performance is the best of any level.
Throughput is increased by striping data across several disks. It splits
data among two or more disks. It provides good performance. It has lack of data redundancy means there is no fail over support
with this configuration. In the diagram to the right, the odd blocks are written
to disk 0 and the even blocks to disk 1 such that A1, A2, A3, A4, would be the
order of blocks read if read sequentially from the beginning. Used I read only NFS
systems and gaming systems.
RAID Level 0: Striping
The first
RAID level is actually not a RAID level at all, in that there is no redundancy.
However, RAID level 0, or striping as it is better known, serves as an
excellent upper-bound on performance and capacity and thus is worth
understanding. The simplest form of striping will stripe blocks across the
disks of the system as follows (assume here a 4-disk array):
Disk
0 Disk 1 Disk 2 Disk 3
0 1 2 3
4 5
6 7
8 9 10 11
12 13 14 15
Table
38.1: RAID-0: Simple Striping
From
Table 38.1, you get the basic idea: spread the blocks of the array across the
disks in a round-robin fashion. This approach is designed to extract the most
parallelism from the array when requests are made for contiguous chunks of the
array (as in a large, sequential read, for example).We call the blocks in the
same row a stripe; thus, blocks 0, 1, 2, and 3 are in the same stripe above. In
the example, we have made the simplifying assumption that only 1 block (each of
say size 4KB) is placed on each disk before moving on to the next. However,
this arrangement need not be the case. For example, we could arrange the blocks
across disks as in Table 38.2:
Disk 0 Disk 1 Disk 2 Disk 3
0 2 4 6 chunk size:
1 3 5 7 2 blocks
8 10 12 14
9 11 13 15
Table
38.2: Striping with a Bigger Chunk Size
In this
example, we place two 4KB blocks on each disk before moving on to the next
disk. Thus, the chunk size of this RAID array is 8KB, and a stripe thus
consists of 4 chunks or 32KB of data.
ASIDE: THE RAID MAPPING PROBLEM
Before
studying the capacity, reliability, and performance characteristics of the
RAID, we first present an aside on what we call the mapping problem. This
problem arises in all RAID arrays; simply put, given a logical block to read or
write, how does the RAID know exactly which physical disk and offset to access?
For these simple RAID levels, we do not need much sophistication in order to
correctly map logical blocks onto their physical locations. Take the first
striping example above (chunk size = 1 block = 4KB). In this case, given a
logical block address A, the RAID can easily compute the desired disk and offset
with two simple equations:
Disk = A %
number_of_disks
Offset = A
/ number_of_disks
Note that
these are all integer operations (e.g., 4 / 3 = 1 not 1.33333...).Let’s see how
these equations work for a simple example. Imagine in the first RAID above that
a request arrives for block 14. Given that there are 4 disks, this would mean
that the disk we are interested in is (14 % 4 = 2): disk 2. The exact block is
calculated as (14 / 4 = 3): block 3. Thus, block 14 should be found on the
fourth block (block 3, starting at 0) of the third disk (disk 2, starting at 0),
which is exactly where it is.You can think about how these equations would be
modified to support different chunk sizes. Try it! It’s not too hard.
Chunk
Sizes
Chunk
size mostly affects performance of the array. For example, a small chunk size
implies that many files will get striped across many disks, thus increasing the
parallelism of reads and writes to a single file; however, the positioning time
to access blocks across multiple disks increases, because the positioning time
for the entire request is determined by themaximu of the positioning times of the
requests across all drives. A big chunk size, on the other hand, reduces such
intra-file parallelism, and thus relies on multiple concurrent requests to
achieve high throughput. However, large chunk sizes reduce positioning time;
if, for example, a single file fits within a chunk and thus is placed on a
single disk, the positioning time incurred while accessing it will just be the
positioning time of a single disk. Thus, determining the “best” chunk size is
hard to do, as it requires a great deal of knowledge about the workload
presented to the disk system [CL95]. For the rest of this discussion, we will
assume that the array uses a chunk size of a single block (4KB). Most arrays
use larger chunk sizes (e.g., 64 KB), but for the issues we discuss below, the
exact chunk size does not matter; thus we use a single block for the sake of
simplicity.
Back To
RAID-0 Analysis
Let
us now evaluate the capacity, reliability, and performance of striping. From
the perspective of capacity, it is perfect: given N disks, striping delivers N
disks worth of useful capacity. From the standpoint of reliability, striping is
also perfect, but in the bad way: any disk failure will lead to data loss.
Finally, performance is excellent: all disks are utilized, often in parallel,
to service user I/O requests.
Evaluating
RAID Performance
In
analyzing RAID performance, one can consider two different performance metrics.
The first is single-request latency. Understanding the latency of a single I/O
request to a RAID is useful as it reveals how much parallelism can exist during
a single logical I/O operation. The second is steady-state throughput of the
RAID, i.e., the total bandwidth of many concurrent requests. Because RAIDs are
often used in high-performance environments, the steady-state bandwidth is
critical, and thus will be the main focus of our analyses. To understand
throughput in more detail, we need to put forth some workloads of interest. We
will assume, for this discussion, that there are two types of workloads:
sequential and random. With a sequential workload, we assume that requests to
the array come in large contiguous chunks; for example, a request (or series of
requests) that accesses 1 MB of data, starting at block (B) and ending at block
(B + 1 MB), would be deemed sequential. Sequential workloads are common in many
environments (think of searching through a large file for a keyword), and thus
are considered important. For random workloads, we assume that each request is
rather small, and that each request is to a different random location on disk.
For example, a random stream of requests may first access 4KB at logical
address 10, then at logical address 550,000, then at 20,100, and so forth. Some
important workloads, such as transactional workloads on a database management system
(DBMS), exhibit this type of access pattern, and thus it is considered an
important workload. Of course, real workloads are not so simple, and often have
a mix of sequential and random-seeming components as well as behaviors in
between the two. For simplicity, we just consider these two possibilities. As
you can tell, sequential and random workloads will result in widely different
performance characteristics from a disk. With sequential access, a disk
operates in its most efficient mode, spending little time seeking and waiting
for rotation and most of its time transferring data. With random access, just
the opposite is true: most time is spent seeking and waiting for rotation and
relatively little time is spent transferring data. To capture this difference
in our analysis, we will assume that a disk can transfer data at S MB/s under a
sequential workload, and R MB/s when under a random workload. In general, S is
much greater than Rotor make sure we understand this difference, let’s do a
simple exercise. Specifically, let’s calculate S and R given the following disk
characteristics. Assume a sequential transfer of size 10 MB on average, and a random
transfer of 10 KB on average. Also, assume the following disk
Characteristics:
Average
seek time 7 ms. Average rotational delay
3 ms.Transfer rate of disk 50 MB/s. To compute S, we need to first figure out
how time is spent in a typical 10 MB transfer. First, we spend 7 ms seeking,
and then 3 ms rotating. Finally, transfer begins; 10 MB @ 50 MB/s leads to
1/5th of a second, or 200 ms, spent in transfer. Thus, for each 10 MB request,
we spend 210 ms completing the request. To compute S, we just need to divide:
S =
Amount of Data
Time
to access = 10 MB
210
ms = 47.62 MB/s
As we
can see, because of the large time spent transferring data, S is very near the
peak bandwidth of the disk (the seek and rotational costs have been amortized).We
can compute R similarly. Seek and rotation are the same; we then compute the
time spent in transfer, which is 10 KB @ 50 MB/s, or 0.195ms.
R =
Amount of Data
Time to
access = 10 KB
10.195 ms
= 0.981 MB/s
As we can
see, R is less than 1 MB/s, and S/R is almost 50.
Back To
RAID-0 Analysis, Again
Let’s now
evaluate the performance of striping. As we said above, it is generally good.
From a latency perspective, for example, the latency of a single-block request
should be just about identical to that of a single disk; after all, RAID-0 will
simply redirect that request to one of its disks. From the perspective of
steady-state throughput, we’d expect to get the full bandwidth of the system.
Thus, throughput equals N (the number of disks) multiplied by S (the sequential
bandwidth of a single disk). For a large number of random I/Os, we can again
use all of the disks, and thus obtain N · R MB/s. As we will see below, these
values are both the simplest to calculate and will serve as an upper bound in
comparison with other RAID levels.
RAID Level 1
Ø It performs following tasks:
Ø A complete file is
stored on a single disk.
Ø A second disk contains an exact copy of the file
Ø Provides complete redundancy of data
Ø Read performance can be improved
Ø file data can be read in parallel
Ø Write performance suffers must write
the data out twice
Ø Most expensive RAID implementation
Ø requires twice as much storage space
RAID Level
1: Mirroring
Our first
RAID level beyond striping is known as RAID level 1, or mirroring. With a mirrored system, we simply make more than
one copy of each block in the system; each
copy should be placed on a separate disk,
of course. By doing so, we can tolerate disk failures. In a typical mirrored
system, we will assume that for each logical block,
the RAID keeps two physical copies of it. Here is an example:
Disk 0 Disk 1 Disk 2 Disk 3
0 0 1 1
2 2 3 3
4 4 5 5
6 6 7 7
Table
38.3: Simple RAID-1: Mirroring
In the
example, disk 0 and disk 1 have identical contents, and disk 2 and disk 3 do as
well; the data is striped across these mirror pairs. In fact, you may have
noticed that there are a number of different ways to place block copies across
the disks. The arrangement above is a common one and is sometimes called
RAID-10 or (RAID 1+0) because it uses mirrored pairs (RAID-1) and then stripes
(RAID-0) on top of them; another common arrangement is RAID-01 (or RAID 0+1),
which contains two large striping (RAID-0) arrays, and then mirrors (RAID-1) on
top of them. For now, we will just talk about mirroring assuming the above
layout. When reading a block from a mirrored array, the RAID has a choice: it can
read either copy. For example, if a read to logical block 5 is issued to the
RAID, it is free to read it from either disk 2 or disk 3. When writing a block,
though, no such choice exists: the RAID must update both copies of the data, in
order to preserve reliability. Do note, though, that these writes can take
place in parallel; for example, a write to logical block 5 could proceed to
disks 2 and 3 at the same time.
RAID-1 Analysis
Let us
assess RAID-1. From a capacity standpoint, RAID-1 is expensive; with the
mirroring level = 2, we only obtain half of our peak useful capacity. Thus,
with N disks, the useful capacity of mirroring is N/2. From a reliability
standpoint, RAID-1 does well. It can tolerate the failure of any one disk. You
may also notice RAID-1 can actually do better than this, with a little luck.
Imagine, in the figure above, that disk 0 and disk 2 both failed. In such a
situation, there is no data loss! More generally, a mirrored system (with
mirroring level of 2) can tolerate 1 disk failure for certain, and up to N/2
failures depending on which disks fail. In practice, we generally don’t like to
leave things like this to chance; thus most people consider mirroring to be
good for handling a single failure. Finally, we analyze performance. From the
perspective of the latency of a single read request, we can see it is the same
as the latency on a single disk; all the RAID-1 does is direct the read to one
of its copies. A write is a little different: it requires two physical writes
to complete before it is done. These two writes happen in parallel, and thus
the time will be roughly equivalent to the time of a single write; however,
because the logical write must wait for both physical writes to complete, it
suffers the worst-case seek and rotational delay of the two requests, and thus
(on average) will be slightly higher than a write to a single disk.
ASIDE: THE
RAID CONSISTENT-UPDATE PROBLEM
Before
analyzing RAID-1, let us first discuss a problem that arises in any multi-disk RAID system, known as the consistent-update
problem [DAA05]. The problem occurs on a
write to any RAID that has to update multiple disks during a single logical
operation. In this case, let us assume we are
considering a mirrored disk array. Imagine the write is issued to the RAID, and
then the RAID decides that it must be
written to two disks, disk 0 and disk 1. The RAID then issues the write to disk 0, but just before the RAID can issue the
request to disk 1, a power loss (or system crash)
occurs. In this unfortunate case, let us assume
that the request to disk 0 completed (but clearly the request to disk 1 did not, as it was never issued).The result of this
untimely power loss is that the two copies of the block are now inconsistent; the copy on disk 0 is the new version,
and the copy on disk 1 is the old. What we would
like to happen is for the state of both disks
to change atomically, i.e., either both should end up as the new version or neither. The general way to solve this problem is
to use a write-ahead log of some kind
to first record what the RAID is about to do (i.e., update two disks with a certain piece of data) before doing it. By taking this
approach, we can ensure that in the presence of a
crash, the right thing will happen; by running
a recovery procedure that replays all pending transactions to the RAID, we can ensure that no two mirrored copies (in the
RAID-1 case)are out of sync. One last note: because logging to disk on every
write is prohibitively expensive, most RAID hardware
includes a small amount of non-volatile RAM
(e.g., battery-backed) where it performs this type of logging. Thus, consistent update is provided without the high cost of
logging to disk. To analyze steady-state throughput, let us start with the
sequential workload. When writing out to disk
sequentially, each logical write must result
in two physical writes; for example, when we write logical block 0 (in the figure above), the RAID internally would write it
to both disk 0 and disk 1. Thus, we can conclude
that the maximum bandwidth obtained during sequential writing to a mirrored
array is (N2 · S), or half the peak band width.
Unfortunately, we obtain the exact same performance during a sequential read. One might think that a sequential read could do better,
because it only needs to read one copy of the data, not both. However, let’s use
an example to illustrate why this doesn’t help much. Imagine we need to read blocks 0, 1, 2, 3, 4, 5, 6, and 7. Let’s say we
issue the read of 0 to disk 0, the read of 1 to disk 2,
the read of 2 to disk 1, and the read of 3 to
disk 3. We continue by issuing reads to 4, 5, 6, and 7 to disks 0, 2, 1 and 3,
respectively. One might naively think that because we are utilizing all disks, we are achieving the full bandwidth of the array.
To see that this is not the case, however, consider the requests a single disk
receives (say disk 0). First, it gets a request for block 0; then, it gets a request for block 4 (skipping block 2). In fact, each disk
receives a request for every other block. While it is
rotating over the skipped block, it is not
delivering useful bandwidth to the client. Thus, each disk will only deliver half its peak bandwidth. And thus, the sequential
read will only obtain a bandwidth of (N2 · S) MB/random
reads are the best case for a mirrored RAID. In this case, we can distribute the reads across all the disks, and thus
obtain the full possible
bandwidth. Thus, for
random reads, RAID-1 delivers N · R MB/finally, random writes performs you might
expect: N2 ·RMB/s. Each logical write must turn into two
physical writes, and thus while all the disks
will be in use, the client will only perceive this as half the available bandwidth. Even though a write to logical block X turns into
two parallel writes to two different physical
disks, the bandwidth of many small requests only
achieves half of what we saw with striping. As we will soon see, getting half the available bandwidth is actually pretty
good!
RAID Level
2
It
performs following tasks
• Uses
Hamming (or any other) error-correcting code (ECC)
• Intended
for use in drives which do not have built-in error detection
• Central
idea is if one of the disks fail the remaining bits of the byte and the
associated ECC bits can be used to reconstruct the data
• Not very
popular
•
Bit-level
Striping with Hamming (ECC) codes for error correction
•
All
7 disk arms are synchronized and move in unison
•
Complicated
controller
•
Single
access at a time
•
Tolerates
only one error, but with no performance degradation
Memory-Style
ECC (RAID Level 2)
Ø Some disks in array are used to hold
ECC
Ø Using Hamming codes as the ECC
Ø
Correct one bit
error in a 4 bits code word requires 3 redundant bits.
Raid Level 3
Improves upon RAID 2, known as
Bit-Interleaved Parity
• Disk Controllers can detect whether
a sector has been read correctly.
• Storage overhead is reduced – only
1 parity disk
• Expense of computing and writing
parity
• Need to include a dedicated parity
hardware
RAID 3: Bit-Interleaved Parity
l Reads and writes go to all disks in a
group, with one extra disk to hold the check information in case there is a
failure. Parity is simply the sum of the data in all the disks modulo 2. Lost data can be reconstructed by examining
the parity. Every access goes to all disks.
l One disk in the array stores parity
for the other disks
¡ Enough to correct the error when the
disk controller tells which disk fails.
+ More efficient that Levels 1 and 2
- Parity disk doesn’t add bandwidth
Raid Level 4
Stripes data at a block level across several drives, with parity stored
on one drive - block-interleaved parity
• Allows
recovery from the failure of any of the disks
• Performance
is very good for reads
• Writes
require that parity data be updated each time. Slows small random writes but
large writes are fairly fast
RAID Level 4: Saving Space with Parity
We now present a different method of adding
redundancy to a disk array known as parity. Parity-based approaches attempt to
use less capacity and thus overcome the huge space penalty paid by mirrored
systems. They do so at a cost, however: performance. In a five-disk RAID-4
system, we might observe the following layout:
Disk 0 Disk
1 Disk 2 Disk 3 Disk 4
0 1 2 3 P0
4 5 6 7 P1
8 9 10 11 P2
12 13 14 15 P3
As you can see, for each stripe of data, we
have added a single parity block that
stores the redundant information for that stripe of blocks. For example, parity block P1 has redundant
information that it calculated from blocks
4, 5, 6, and 7.To compute parity, we need to use a mathematical function that
enables us to withstand the loss of any one block
from our stripe. It turns out the
simple function XOR does the trick quite nicely. For a given set of bits, the XOR of all of those bits returns
a 0 if there are an even number of 1’s in the
bits, and a 1 if there are an odd number of 1’s. For example:
C0 C1 C2 C3 P
0 0 1 1 XOR (0, 0,1, 1) = 0
0 1 0 0
XOR (0, 1,0,0) = 1
In the first row (0,0,1,1), there are two 1’s (C2, C3),
and thus XOR of all of those values will be 0 (P); similarly, in the second row
there is only one 1 (C1), and thus the XOR must be 1 (P). You can remember this
in a very simple way: that the number of 1’s in any row must be an even (not odd) number; that is the invariant
that the RAID must maintain in order for parity to be correct. From the example
above, you might also be able to guess how parity information can be used to
recover from a failure. Imagine the column labeled C2 is lost. To figure out
what values must have been in the column, we simply have to read in all the
other values in that row (including the XOR’d parity bit) and reconstruct the right answer. Specifically, assume the
first row’s value in
column C2 is lost (it is a 1); by reading the other values in that row (0
fromC0, 0 fromC1, 1 fromC3, and 0 from the parity column P), we get the values
0, 0, 1, and 0. Because we know that XOR keeps an even number of 1’s in each
row, we know what the missing data must be: a 1. And that is how reconstruction
works in a XOR-based parity scheme! Note also how we compute the reconstructed
value: we just XOR the data bits and the parity bits together, in the same way
that we calculated the parity in the first place. Now you might be wondering:
we are talking about XORing all of these bits, and yet above we know that the
RAID places 4KB (or larger) blocks on each disk; how do we apply XOR to a bunch
of blocks to compute the parity? It turns out this is easy as well. Simply
perform a bitwise XOR across each bit of the data blocks; put the result of
each bitwise XOR into the corresponding bit slot in the parity block. For
example, if we had blocks of size 4 bits (yes, this is still quite a bit
smaller than a 4KB block, but you get the picture), they might look something
like this:
Block0 Block1
Block2 Block3 Parity
00 10 11 10 11
10 01 00 01 10
As you can see from the figure, the parity is computed for each bit of each
block and the result placed in the parity block.
RAID-4 Analysis
Let us now analyze RAID-4. From a capacity
standpoint, RAID-4 uses 1 disk for
parity information for every group of disks it is protecting. Thus, our useful capacity for a RAID group is
(N-1). Reliability is also quite easy to
understand: RAID-4 tolerates 1 disk failure and no more. If more than one disk is lost, there is simply no
way to reconstruct the lost data. Finally, there is performance. This time,
let us start by analyzing steady state throughput. Sequential read performance can utilize all of the disks except for the parity disk, and thus
deliver a peak effective bandwidth of(N − 1) · S MB/s (an
easy case).To understand the performance of sequential writes, we must first
understand how they are
done. When writing a big chunk of data to disk, RAID-4 can perform a simple
optimization known as a full-stripe write. For example, imagine the case where
the blocks 0, 1, 2, and 3 have been sent to the RAID as part of a write request (Table 38.4).
Disk 0 Disk 1 Disk 2 Disk 3 Disk 4
0 1 2 3 P0
4 5 6 7 P1
8 9 10 11 P2
12 13 14 15 P3
Table 38.4: Full-stripe writes in RAID-4
In this case, the RAID can simply calculate
the new value of P0 (by performing an XOR across the blocks 0, 1, 2, and 3) and
then write all of the blocks (including the parity block) to the five disks
above in parallel (highlighted in gray in the figure). Thus, full-stripe writes
are the most efficient way for RAID-4 to write to disk. Once we understand the
full-stripe write, calculating the performance of sequential writes on RAID-4 is
easy; the effective bandwidth is also (N −1) ·S MB/s. Even
though the parity disk is constantly in use during the operation, the client
does not gain performance advantage from it.Now let us analyze the performance of
random reads. As you can also see from the figure above, a set of 1-block
random reads will be spread across the data disks of the system but not the
parity disk. Thus, the effective performance is: (N − 1) · R MB/s. Random
writes, which we have saved for last, present the most interesting case for
RAID-4. Imagine we wish to overwrite block 1 in the example above. We could
just go ahead and overwrite it, but that would leave us with a problem: the
parity block P0 would no longer accurately reflect the correct parity value for
the stripe. Thus, in this example, P0 must also be updated. But how can we
update it both correctly and efficiently? It turns out there are two methods.
The first, known as additive parity, requires us to do the following. To compute
the value of the new parity block, read in all of the other data blocks in the
stripe in parallel (in the example, blocks 0, 2, and 3) and XOR those with the
new block (1). The result is your new parity block. To complete the write, you
can then write the new data and new parity to their respective disks, also in parallel.
The problem with this technique is that it scales with the number of disks, and
thus in larger RAIDs requires a high number of reads to compute parity. Thus,
the subtractive parity method. For example, imagine this string of bits (4 data
bits, one parity):
C0 C1 C2 C3
P
0 0 1 1 XOR (0,0,1,1) = 0
Let’s imagine that we wish to overwrite bit C2 with a new value which we
will call C2 (new). The subtractive method works in three steps. First, we read
in the old data at C2 (C2 (old) = 1) and the old parity (P (old) = 0). Then, we
compare the old data and the new data; if they are the same (e.g., C2 (new) = C2
(old)), then we know the parity bit will also remain the same (i.e., P (new) = P
(old)). If, however, they are different, then we must flip the old parity bit
to the opposite of its current state, that is, if (P(old) == 1), P(new) will be
set to 0; if (P(old) == 0), P(new) will be set to 1. We can express this whole
mess neatly with XOR as it turns out (if you understand XOR, this will now make
sense to you): P (new) = (C (old) XOR C (new)) XOR P (old) Because we are
dealing with blocks, not bits, we perform this calculation over all the bits in
the block (e.g., 4096 bytes in each block multiplied by 8 bits per byte). Thus,
in most cases, the new block will be different than the old block and thus the
new parity block will too.You should now be able to figure out when we would
use the additive parity calculation and when we would use the subtractive method.
Think about how many disks would need to be in the system so that the additive method
performs fewer I/Os than the subtractive method; what is the cross-over point?
For this performance analysis, let us assume we are using the subtractive method.
Thus, for each write, the RAID has to perform 4 physical I/Os (two reads and
two writes). Now imagine there are lots of writes submitted to the RAID; how
many can RAID-4 perform in parallel? To understand, let us again look at the
RAID-4 layout (Figure 38.5).
Disk 0 Disk 1 Disk 2 Disk 3 Disk 4
0 1 2 3 P0
4 5 6 7 +P1
8 9 10 11 P2
12 13 14 15 +P3
Table 38.5: Example: Writes To 4, 13, and Respective Parity Blocks
Now imagine there were 2 small writes
submitted to the RAID-4 at about the same time, to blocks 4 and 13 (marked
with in the diagram).The data for those
disks is on disks 0 and 1, and thus the read and write to data could happen in
parallel, which is good. The problem that arises is with the parity disk; both
the requests have to read the related parity blocks for 4 and 13, parity blocks
1 and 3 (marked with +). Hopefully, the issue is now clear: the parity disk is
a bottleneck under this type of workload; we sometimes thus call this the small-write
problem for parity based RAIDs. Thus, even though the data disks could be
accessed in parallel, the parity disk prevents any parallelism from
materializing; all writes to the system will be serialized because of the
parity disk. Because the parity disk has to perform two I/Os (one read, one
write) per logical I/O, we can compute the performance of small random writes
in RAID-4 by computing the parity disk’s performance on those two I/Os, and thus we achieve (R/2) MB/s. RAID-4
throughput under random small writes is terrible; it does not improve as you
add disks to the system. We
conclude by analyzing I/O latency in RAID-4. As you now know, a single read
(assuming no failure) is just mapped to a single disk, and thus its latency is equivalent to the
latency of a single disk request. The latency of a single write requires two
reads and then two writes; the reads can happen in parallel, as can the
writes, and thus total latency is about twice that of a single disk (with some
differences because we have to wait for both reads to complete and thus get
the worst-case positioning time, but then the updates don’t incur seek
cost and thus may be a better-than average positioning cost).
RAID Level
5: Rotating Parity
To address the small-write problem (at
least, partially), Patterson, Gibson, and Katz introduced RAID-5. RAID-5 works
almost identically to RAID-4,
except that it rotates the parity block across drives (Figure 38.6).
Disk 0 Disk 1 Disk 2 Disk 3 Disk 4
0 1 2 3 P0
5 6 7 P1 4
10 11 P2 8 9
15 P3 12 13 14
P4 16 17 18 19
Table 38.6: RAID-5With Rotated Parity
As you can see, the parity block for each stripe is now rotated across the
disks, in order to remove the parity-disk bottleneck for RAID-4.
RAID-5 Analysis
Much of the analysis for RAID-5 is
identical to RAID-4. For example, the effective capacity and failure tolerance
of the two levels are identical. So are sequential read and write performance.
The latency of a single request (whether a read or a write) is also the same as
RAID-4.Randomread performance is a little better, because we can utilize all of the disks. Finally, random write
performance improves noticeably overRAID-4, as it allows for parallelism across
requests. Imagine a write to block 1 and
a write to block 10; this will turn into requests to disk 1 and disk 4 (for block 1 and its parity) and
requests to disk 0 and disk 2 (for block 10 and its parity). Thus, they can proceed in parallel. In fact,
we can generally assume that that given a
large number of random requests, we will be able to keep all the disks about
evenly busy. If that is the case, then our total bandwidth for small writes
will be N4 · R MB/s. The
factor of four loss is due to the fact that each
RAID-5 write still generates 4 total I/O operations, which is simply the cost of using parity-based
RAID.REDUNDANT ARRAYS OF INEXPENSIVE DISKS (RAIDS) 15
RAID-0
RAID-1 RAID-4 RAID-5
Capacity N N/2
N − 1 N − 1
Reliability
0 1(for
sure) 1 1
N/2 (if lucky)
Throughput
Sequential Read N · S (N/2)
· S (N − 1) · S (N − 1) · S
Sequential Write N · S (N/2) · S (N − 1) · S (N − 1) · S
Random Read N · R N · R
(N − 1) · R N · R
Random Write N · R (N/2) · R 1/2· R N/4 R
Latency
Read
D D D
D
Write D D 2D 2D
Table 38.7: RAID Capacity, Reliability, and
Performance
Because RAID-5 is basically identical to
RAID-4 except in the few cases where it is better, it has almost completely replaced
RAID-4 in them market place. The only place where it has not is in systems that
know they will never perform anything other than a large write, thus avoiding the
small write problem altogether [HLM94]; in those cases, RAID-4 is sometimes used
as it is slightly simpler to build.
Raid level 6
For raid 6 array with n disks,
Ø Data is
divided into n-2 strips
Ø Two pieces
of parity information are collected over the n-2 strips(using a Reed-Solomon
code)
Ø The parity
information is collected over the disk
Raid Advantage
Ø RAID provides an environment of highly
reliable, fault tolerant disk drive sub-systems
Ø RAID adds life to disk drives as the
controller manages the I/O load equally amongst all the drives in the Array reducing the risk of
“single point-of-failure”
Ø A “successful” RAID environment includes
reliable robust disk drives
Ø Each RAID environment will include multiple disk drives
Ø Mylex PCI RAID controllers can attach as
many as 45 drives per RAID controller, up to 16 controllers per system for an astonishing -- 720 disk drives per system configuration
Ø Mylex external RAID controllers can attach
as many as 90 drives per storage subsystem and can reside as part of a SAN
(Storage Area
Ø Network)
Ø Capacities range from terabytes to
petabytes!
REFERENCE:
PPT+BOOK (Redundant
Arrays of Inexpensive Disks RAIDs)
Stripes data at a block level across several
• Allows recovery from the failure of any of the disks
• Performance is very good for reads
• Writes require that parity data be updated each time. Slows
small
butlarge writes are fairly fast
Stripes data at a block level across several drives, with parity
1 parity disk
• Expense of
computing and writing parity
• Need to include a dedicated parity hardware
Gibson and Katz at
the Ut a block level across several drives, with parity
1 parity disk
• Expense of
computing and writing parity
• Need to include a dedicated parity hardware
Gibson and Katz at
the U
No comments:
Post a Comment