Translate

Friday, January 4, 2013

Reducing hit time


First Hit Time Reduction: Small and Simple Caches to Reduce Hit Time

·         A time-consuming portion of a cache hit is using the index portion of the address to read the tag memory and then compare it to the address. Smaller hardware can be faster, so a small cache can help the hit time. It is also critical to keep an L2 cache small enough to fit on the same chip as the processor to avoid the time penalty of going off chip.
·         The second suggestion is to keep the cache simple, such as using direct mapping. One benefit of direct-mapped caches is that the designer can overlap the tag check with the transmission of the data. This effectively reduces hit time.
·         The pressure of a fast clock cycle encourages small and simple cache designs for first-level caches.
·         For lower-level caches, some designs strike a compromise by keeping the tags on chip and the data off chip, promising a fast tag check, yet providing the greater capacity of separate memory chips.
·         Although the amount of on-chip cache increased with new generations of microprocessors, the size of the L1 caches has recently not increased between
generations.
·         The L1 caches are the same size for three generations of AMD microprocessors: K6, Athlon, and Opteron. The emphasis is on fast clock rate while hiding L1 misses with dynamic execution and using L2 caches to avoid going to memory.

Second Hit time Reduction: Way Prediction to Reduce Hit Time
·         In way prediction, extra bits are kept in the cache to predict the way, or block within the set of the next cache access.
·         This prediction means the multiplexor is set early to select the desired block, and only a single tag comparison is performed that clock cycle in parallel with reading the cache data. A miss results in checking the other blocks for matches in the next clock cycle.
·         Added to each block of a cache are block predictor bits. The bits select which of the blocks to try on the access latency is the fast hit time. 
·         If not, it tries the other block, changes the waypredictor, and has a latency of one extra clock cycle.
·         Simulations suggested setprediction accuracy is in excess of 85% for a two-way set, so way prediction saves pipeline stages more than 85% of the time.
·         Way prediction is a good match to speculative processors, since they must already undo actions when speculation is unsuccessful. The Pentium 4 uses way prediction.

Third Hit Time Reduction: Trace Caches to Reduce Hit Time
·         A challenge in the effort to find lots of instruction-level parallelism is to find
enough  instructions every cycle without use dependencies.
·         To address this challenge, blocks in a trace cache contain dynamic traces of the executed instructions rather than static sequences of instructions as determined by layout in memory.
·         Hence, the branch prediction is folded into the cache and must be validated along with the addresses to have a valid fetch.
·         Trace caches have much more complicated address-mapping mechanisms, as the addresses are no longer aligned to power-of-two multiples of the word size. However, they can better utilize long blocks in the instruction cache.

·         The downside of trace caches is that conditional branches making different choices result in the same instructions being part of separate traces, which each occupy space in the trace cache and lower its space efficiency.
·         Many optimizations are simple to understand and are widely used, but a trace cache is neither simple nor popular.
·         It is relatively expensive in area, power, and complexity compared to its benefits, so we believe trace caches are likely a one time innovation. We include them because they appear in the popular Pentium 4.

Fourth Hit Time Reduction: Pipelined Cache Access to Increase Cache Bandwidth
·         This optimization is simply to pipeline cache access so that the effective latency
of a first-level cache hit can be multiple clock cycles, giving fast clock cycle time
and high bandwidth but slow hits.
·         For example, the pipeline for the Pentium took 1 clock cycle to access the instruction cache, for the Pentium Pro through Pentium III it took 2 clocks, and for the Pentium 4 it takes 4 clocks.
·         This split increases the number of pipeline stages, leading to greater penalty on mispredicted branches and more clock cycles between the issue of the load and the useof the data

No comments:

Post a Comment