The
Basics of Striping
The process of taking a pool of data
and evenly dividing it up across a set of drives is called
"striping." To illustrate, let's look at a data warehouse that stores
information pertaining to foreign automobile sales for the United States.
Assume that our platform has four CPUs. Also, for simplicity's sake, assume our
warehouse only has information on Mercedes, Porsche, BMW and Volvo, and that it
has roughly an equivalent amount of information on each type of car, all of
which is stored in a single table called "Car_Sales." If we were to
put all the information on a single disk drive (assuming for a minute that it
would fit), then only one scan process would be able to read the table at a
time. The optimal solution is to spread the data (that is, stripe the data)
across at least four disks as shown in Figure 1.
Now, when we query Car_Sales, we can
use four scan processes, and each process will read one-fourth of the total
table, completing a query in approximately one-fourth of the time it would have
taken to scan the table without parallelism. We can continue extending this
principle even further by adding more CPUs, striping the data over more disk
drives and using more scan processes.
Striping can be performed by either
the hardware, the operating system or the database. Each one has its benefits
and drawbacks.
Hardware
Striping
This method of disk striping
involves purchasing specialized intelligent disk array technology which
includes additional hardware that automatically handles striping the data
across the multiple disks in the disk array. To the rest of the system, this
disk array usually looks like a single (albeit very fast) disk drive that has
the ability to simultaneously handle multiple I/O requests. The striping is
usually done in a round- robin fashion, which means that chunks of data
(usually 32K to 64K each) are distributed to disk drives similar to the way a
card dealer deals out a deck of cards. The benefit of using this technique is
that data is spread evenly over many physical devices, balancing the I/O load
across all the disks. This, therefore, minimizes the risk of having disk
"hot spots" which occur when data is requested from some drives much
more frequently than others. Another advantage is that these intelligent disk
arrays also automatically handle various RAID levels.
However, the hardware striping
solution is usually the most expensive method of achieving disk striping. Also,
the resulting parallelism is not necessarily what you would hope for. In
general, to maximize I/O throughput, you always want to have the disk head move
smoothly across the disk in one continuous motion, streaming the data back to
the system as it goes. Unfortunately, because the database is unaware of how
the data is actually striped (remember, with hardware striping the striping is intended
to be transparent to the system), the I/O requests issued by a single scan
thread will almost always reference data that is on multiple drives. This
problem affects each scan thread, so all the disks in the array are constantly
satisfying requests from multiple threads. The disk heads will have to
constantly seek back and forth, significantly lowering I/O performance.
Operating
System Striping
Operating system striping introduces
the concept of a "logical volume group," which (similar to hardware
striping) appears as a single disk device to the database. However, it actually
consists of pieces from multiple physical disk drives logically grouped
together to give the appearance of one physical device. The data is distributed
across the pieces of the various disks in a round- robin fashion. As with
hardware striping, this approach removes hot spots, and since there is no
special hardware required, this solution is cheaper. However, more CPU
resources are needed to manage the logical volume group. Also, it suffers from
the same head-seeking problem I discussed earlier, since a scan thread can only
issue a request to the logical volume group, not to a specific disk within that
group (see Figure 2).
Database
Striping
Of the three striping techniques
available, database striping is the easiest to employ and offers the best
performance when there is a smaller number of concurrent users (than in typical
OLTP applications) running parallel queries. Using this technique, a database
table is divided into a number of sections (called "fragments" or
"extents"), and each section is assigned to a specific drive. The
database then has the ability to assign a single scan process to a single
fragment/extent and, as a result, be assured that the head seek movement will
be minimized (because the scans are sequential and the disk head does not have
to be repositioned elsewhere on the disk platter to service another request).
The downside of database striping
lies in the fact that each fragment/extent is a separate operating system file.
Therefore, there will be many more data files to manage compared with operating
system striping or hardware striping, where the four separate sections would be
treated and managed as a single file. It's simply a tradeoff between
performance and ease of maintenance.
Conclusion
A scalable application must be
thought of as a "performance chain." All components in the chain must
be scalable for the entire application to be scalable. If any component is not
scalable, then a bottleneck exists in your performance chain, and your
application as a whole will have limited scalability. As a major component of
any application, the I/O subsystem must also be scalable. Striping is one of
the most effective techniques for removing disk hot spots, and it's required if
you want to be able to take advantage of I/O parallelism and have a highly
scalable I/O subsystem.
No comments:
Post a Comment