|
The Hardware Model
An ideal processor is one where all
constraints on ILP are removed. The only limits on ILP in such a processor
are those imposed by the actual data flows through either registers or
memory.
The assumptions made for an ideal or perfect
processor are as follows:
1.Register
renaming
—There are an infinite number of virtual
registers available, and hence all WAW and WAR hazards are avoided and an
unbounded number of instructions can begin execution simultaneously.
2.Branch
prediction
—Branch prediction is perfect. All conditional
branches are predicted exactly.
3.Jump
prediction
—All jumps (including jump register used for
return and computed jumps) are perfectly predicted. When combined with
perfect branch prediction, this is equivalent to having a processor with
perfect speculation and an unbounded buffer of instructions available for
execution.
4.Memory
address alias analysis
—All memory addresses are known exactly, and a
load can be moved before a store provided that the addresses are not
identical. Note that this implements perfect address alias analysis.
5.Perfect
caches
—All memory accesses take 1 clock cycle. In
practice, superscalar processors will typically consume large amounts of ILP
hiding cache misses, making these results highly optimistic.
To measure the available parallelism, a set of
programs was compiled and optimized with the standard MIPS optimizing
compilers. The programs were instrumented and executed to produce a trace of
the instruction and data references. Every instruction in the trace is then
scheduled as early as possible, limited only by the data dependences. Since a
trace is used, perfect branch prediction and perfect alias analysis are easy
to do. With these mechanisms, instructions may be
scheduled much earlier than they would
otherwise, moving across large numbers of instructions on which they are not
data dependent, including branches, since branches are perfectly predicted.
The effects of various assumptions are given
before looking at some ambitious but realizable processors.
Limitations on the Window Size and Maximum
Issue Count
To build a processor that even comes close to perfect branch
prediction and perfect alias analysis requires extensive dynamic analysis,
since static compile time schemes cannot be perfect. Of course, most realistic
dynamic schemes will not be perfect, but the use of dynamic schemes will
provide the ability to uncover parallelism that cannot be analyzed by static
compile time analysis. Thus, a dynamic processor might be able to more
closely match the amount of parallelism uncovered by our ideal processor.
|
The Effects of Realistic Branch and Jump
Prediction
Our
ideal processor assumes that branches can be perfectly predicted: The outcome
of any branch in the program is known before the first instruction is executed!
Of course, no real processor can ever achieve this.
We
assume a separate predictor is used for jumps. Jump predictors are important
primarily with the most accurate branch predictors, since the branch frequency
is higher and the accuracy of the branch predictors dominates.
The five levels of branch prediction
1.Perfect
—All branches and jumps are perfectly predicted at the start of
execution.
2.Tournament-based
branch predictor —The prediction scheme uses a correlating 2-bit
predictor and a noncorrelating 2-bit predictor together with a selector, hich
chooses the best predictor for each branch.
The Effects of Finite Registers
Our ideal processor eliminates all name
dependences among register references using an infinite set of virtual
registers. To date, the IBM Power5 has provided the largest numbers of virtual
registers: 88 additional floating-point and 88 additional integer registers, in
addition to the 64 registers available in the base architecture. All 240
registers are shared by two threads when executing in multithreading mode, and
all are available to a single thread when in single-thread mode.
The Effects of Imperfect Alias Analysis
Our optimal model assumes that it can perfectly
analyze all memory dependences, as well as eliminate all register name
dependences. Of course, perfect alias analysis is not possible in practice: The
analysis cannot be perfect at compile time, and it requires a potentially
unbounded number of comparisons at run time (since the number of simultaneous
memory references is unconstrained).
The three models are
1. Global/stack
perfect—This model does perfect predictions for global and stack
references and assumes all heap references conflict. This model represents an
idealized version of the best compiler-based analysis schemes currently in production.
Recent and ongoing research on alias analysis for pointers should improve the
handling of pointers to the heap in the future.
2. Inspection—This
model examines the accesses to see if they can be determined not to interfere
at compile time. For example, if an access uses R10 as a base register with an
offset of 20, then another access that uses R10 as a base register with an
offset of 100 cannot interfere, assuming R10 could not have changed. In
addition, addresses based on registers that point to different allocation areas
(such as the global area and the stack area) are assumed never to alias. This
analysis is similar to that performed by many existing commercial compilers,
though newer compilers can do better, at least for looporiented programs.
3. None—All
memory references are assumed to conflict. As you might expect, for the FORTRAN
programs (where no heap references exist), there is no difference between
perfect and global/stack perfect analysis.
No comments:
Post a Comment