Translate

Friday, January 4, 2013

Reducing cache miss penalty and miss rate


Reducing Cache Miss Penalty
            
First Miss Penalty Reduction Technique: Multi-Level Caches

           Many techniques to reduce miss penalty affect the CPU. This technique ignores
the CPU, concentrating on the interface between the cache and main memory.
·         The first-level cache can be small enough to match the clock cycle time of the fast CPU.
·                The second-level cache can be large enough to capture many accesses that would go to main memory, there by lessening the effective miss penalty.
              L1 and L2 to refer respectively, to a first-level and a second-level cache, the original formula is
                       Average memory access time = Hit timeL1 + Miss rateL1 ??Miss penaltyL1
                                    
                                    Miss penaltyL1=Hit time L2+Miss rateL2-Miss penaltyL2

                         Average memory access time = Hit timeL1 + Miss rateL1 ???Hit time              L2+Miss rateL2* Miss penaltyL2)


·         Local miss rate—This rate is simply the number of misses in a cache divided
       by the total number of memory accesses to this cache. As you would expect,
      for the first-level cache it is equal to Miss rateL1 and for the second-level cache
      it is Miss rateL2.

·          Global miss rate—The number of misses in the cache divided by the total number
     of memory accesses generated by the CPU. Using the terms above, the
     global miss rate for the first-level cache is still just Miss rateL1 but for the second-
     level cache it is Miss rateL1 ??Miss rateL2.
                
                           Average memory stalls per instruction = Misses per instructionL1*Hit timeL2 + Misses per  instructionL2 * Miss penaltyL2.

Second Miss Penalty Reduction TechniqueCritical Word First and Early Restart

·         Critical word first—Request the missed word first from memory and send it to
              the CPU as soon as it arrives; let the CPU continue execution while filling the
             rest of the words in the block. Critical-word-first fetch is also called
              fetch and requested word first
·          Early restart—Fetch the words in normal order, but as soon as the requested
             word of the block arrives, send it to the CPU and let the CPU continue execution.

Third Miss Penalty Reduction TechniqueGiving Priority to Read Misses over Writes          

·         This optimization serves reads before writes have been completed.
·         Write buffers, however, do complicate memory accesses in that they might hold the updated value of a location needed on a read miss.
·          The alternative is to check the contents of the write buffer on a read miss, and if there are no conflicts and the memory system is available,
·         The cost of writes by the processor in a write-back cache can also be reduced.
Suppose a read miss will replace a dirty memory block. Instead of writing the
dirty block to memory, and then reading memory, we could copy the dirty block
to a buffer, then read memory, and for which the processor is probably waiting, will
      finish sooner.
·          Similar to the situation above, if a read miss occurs, the processor can either
stall  until the buffers empty or check the addresses of the words in the buffer for conflicts.

Fourth Miss Penalty Reduction Technique: Merging Write Buffer
·         This technique also involves write buffers, this time improving their efficiency. Write through caches rely on write buffers, as all stores must be sent to the next lower level of the hierarchy.

·         As mentioned above, even write back caches use a simple buffer when a block is replaced. If the write buffer is empty, the data and the full address are written in the buffer
·         The write is finished from the CPU's perspective; the CPU continues working while the write buffer prepares to write the word to memory. If the buffer contains other modified blocks, the addresses can be checked to see if the address of this new data matches the address of the valid write buffer entry.
·         If so, the new data are combined with that entry called  write merging
·         If the buffer is full and there is no address match, the cache (and CPU) must wait until the buffer has an empty entry. This optimization uses the memory more efficiently since multiword writes are usually faster than writes performed one word at time.
        
Fifth Miss Penalty Reduction Technique: Victim Caches
·         One approach to lower miss penalty is to remember what was discarded in case it is needed again. Since the discarded data has already been fetched, it can be used again at small cost.
·         Such  recycling  requires a small, fully associative cache between a cache and its refill path. This only blocks that are discarded from a cache because of a miss-victim-and are checked on a miss to see if they have the desired data before going to the next lower-level memory.
·         If it is found there, the victim block and cache block are swapped.
·         Depending on the program, a four-entry victim cache might remove one quarter of the misses in a 4-KB direct-mapped data cache.

Reducing Miss Rate

We first start with a model that sorts all misses into three simple categories:

·         Compulsory—The very first access to a block cannot be in the cache, so the block must be brought into the cache. These are also called cold start misses or first reference misses or first reference misses
·         Capacity—If the cache cannot contain all the blocks needed during execution
              of a program, capacity misses (in addition to compulsory misses) will occur be-
               cause of blocks being discarded and later retrieved.
·         Conflict—If the block placement strategy is set associative or direct mapped,
              conflict misses  will occur be cause a block may be discarded and later retrieved if too many blocks map to its set. These misses are also called collision misses or interference misses
           
First Miss Rate Reduction Technique: Larger Block Size
         
·         The simplest way to reduce miss rate is to increase the block size.
·         Larger block sizes will reduce compulsory misses.
·         This reduction occurs because the principle of locality has two components: temporal  locality, spatial  locality.
·         Larger blocks take advantage of spatial locality.
·         At the same time, larger blocks increase the miss penalty. Since they reduce
the number of blocks in the cache, larger blocks may increase conflict misses and
even capacity misses if the cache is small.


Second Miss Rate Reduction Technique: Larger caches
  
The obvious drawback is longer hit time and higher cost. This technique has been especially popular in off-chip caches: The size of second or third level caches in 2001 equals the size of main memory in desktop computers .

Third Miss Rate Reduction Technique: Higher Associativity
            
  There are two general rules of thumb .
·         The first is that eight-way set associative is for practical purposes as effective in reducing misses for these sized caches as fully associative. You can see the difference by comparing the 8-way entries to the capacity miss column  since capacity misses are calculated using fully associative cache.
·         The second observation, called the 2:1 cache rule of thumb and found on the front inside cover, is that a direct-mapped cache of size  has about the same miss rate as a 2-way set-
associative cache of size /2. This held for cache sizes less than 128 KB.
·         Like many of these examples, improving one aspect of the average memory access time comes at the expense of another. Increasing block size reduced miss rate while increasing miss penalty, and greater associativity  can come at the cost of increased hit time
·         Hence, the pressure of a fast processor clock cycle encourages simple cache designs, but the increasing miss penalty rewards associativity, as the following example suggests.


Fourth Miss Rate Reduction Technique: Way Prediction and Pseudo-Associative Caches
         
·         In way-prediction, extra bits are kept in the cache to predict cache access. This prediction means the multiplexor is set early to  select the desired set, and only a single tag comparison is performed that clockcycle. A miss results in checking the other sets for matches in subsequent clock cycles.
·         A related approach is called pseudo-associative or column associative
·         Accesses proceed just as in the direct-mapped cache for a hit. On a miss, however,before going to the next lower level of the memory hierarchy, a second cache entry is checked to see if it matches there. A simple way is to invert the most significant bit of the index field to find the other block in the pseudo set.
·         Pseudo-associative caches then have one fast and one slow hit time corresponding to a regular hit and a pseudo hit in addition to the miss penalty.
·         The performance would then be degraded by this optimization. Hence, it is important to indicate for each set which block should be the fast hit and which should be the slow one. One way is simply to make the upper one fast and swap the contents of the blocks. Another danger is that the miss penalty may become slightly longer, adding the time to check another cache entry.

Fifth Miss Rate Reduction Technique: Compiler Optimizations

Code and Data Rearrangement
·         Code can easily be rearranged without affecting correctness; for example,reordering the procedures of a program might reduce instruction miss rates by reducing conflict misses. Reordering the instructions reduced misses by 50% for a 2-KB direct-mapped instruction cache with 4 byte blocks, and by 75% in an 8-KB cache.
·         Another code optimization aims for better efficiency from long cache blocks.Aligning basic blocks so that the entry point is at the beginning of a cache block decreases the chance of a cache miss for sequential code.
·         Data have even fewer restrictions on location than code. The goal of such transformations is to try to improve the spatial and temporal locality of the data.
·         For example, array calculations can be changed to operate on all the data in a cache block rather than blindly striding through arrays in the order the programmer happened to place the loop.

Loop Interchange
·         Some programs have nested loops that access data in memory in non sequential order.
·         Simply exchanging the nesting of the loops can make the code access the data in the order it is                                                   stored.
·         Assuming the arrays do not fit in cache, this technique reduces misses by improving spatial locality; reordering maximizes use of data in a cache block before it is discarded.
                                     /* Before */
                                            for (j = 0; j \< 100; j = j+1)
                                                   for (i = 0; i \< 5000; i = i+1)
                                                          x\[i]\[j] = 2 * x\[i]\[j];
                                    /* After */
                                           for (i = 0; i \< 5000; i = i+1)
                                                 for (j = 0; j \< 100; j = j+1)
                                                         x\[i]\[j] = 2 * x\[i]\[j];
·         The original code would skip through memory in strides of 100 words, while the
revised version accesses all the words in one cache block before going to the next
block.
·         This optimization improves cache performance without affecting the number of instructions executed, unlike the prior example.

Blocking
·         This optimization tries to reduce misses via improved temporal locality. We are again dealing with multiple arrays, with some arrays accessed by rows and someby columns.
·         Storing the arrays row by row (row major order) or column by column (column major order) does not solve the problem because both rows and columns are used in every iteration of the loop. Such orthogonal accesses mean the transformations such as loop interchange are not helpful.
·         Instead of operating on entire rows or columns of an array, blocked algorithms
operate on submatrices or blocks.
·         The goal is to maximize accesses to the data loaded into the cache before the data are replaced. The code example below,which performs matrix multiplication, helps motivate the optimization:
                                          /* Before */
                                                 for (i = 0; i < N; i = i+1)
                                                         for (j = 0; j < N; j = j+1)
                                                                      {
                                                                           r = 0;
                                                                      for (k = 0; k < N; k  = k + 1)
                                                                               r = r + y[i][k]*z[k][j];
                                                                       x[i][j] = r;
                                                                       };



·         The two inner loops read all  by  elements of z, read the same  elements in a row of y
repeatedly, and write one row of  elements of x.
·         The number of capacity misses clearly depends on and the size of the cache.If it can hold all three N  by  N . If the cache can hold one N by N matrix and one row of N ,then  atleast the i th row of y and the array z  may stay in the cache.
      
Reducing Cache Miss Penalty or Miss Rate via Parallelism
         
First Miss penalty/Rate Reduction technique: Nonblocking Caches to Increase Cache  Bandwidth

·         For pipelined computers that allow out-of-order completion , the processor need not stall on a data cache miss.
·         For example, the processor could continue fetching instructions from the instruction cache while waiting for the datacache to return the missing data.
·         A nonblocking cache or lockup-free cache escalates the potential benefits of such a scheme by allowing the data cache to continue to supply cache hits during a miss.
·         This “hit under miss” optimization   reduces the effective miss penalty by being helpful during a miss instead of ignoring the requests of the processor.
·         A subtle and complex option is that the cache may further lower the effective miss penalty if it can overlap multiple misses: a “hit under multiple miss or miss under miss” optimization. The sec-ond option is beneficial only if the memory system can service multiple misses.            
·         The real difficulty with performance evaluation of nonblocking caches is that a cache miss does not necessarily stall the processor. In this case, it is difficult to judge the impact of any single miss, and hence difficult to calculate the average memory access time. 

Second Miss penalty/Rate Reduction technique: Hardware Prefetching of Instructions and Data to Reduce Miss Penalty or Miss Rate
·         Nonblocking caches effectively reduce the miss penalty by overlapping execution with memory access.
·         Another approach is to prefetch items before the processor requests them. Both instructions and data can be prefetched, either directly into the caches or into an external buffer that can be more quickly accessed than main memory.
·         Instruction prefetch is frequently done in hardware outside of the cache.
·         Typically, the processor fetches two blocks on a miss: the requested block and the next consecutive block.
·         The requested block is placed in the instruction cache when it returns, and the prefetched block is placed into the instruction stream buffer. If the requested block is present in the instruction stream buffer, the original cache request is canceled, the block is read from the stream buffer, and the
next prefetch request is issued.

Third Miss penalty/Rate Reduction technique: Compiler-Controlled Prefetching to Reduce Miss Penalty or Miss Rate
·         Register  prefetch  will load the value into a register.
·         Cache  prefetch  loads data only into the cache and not the register.

No comments:

Post a Comment