Translate

Friday, January 4, 2013

Compiler techniques for exposing ILP

LECTURE:18

UNIT III Compiler techniques for exposing ILP
Basic Pipeline Scheduling and Loop Unrolling:
Ø  Pipeline stall, a dependent instruction must be separated from the source instruction by a distance in clock cycles equal to the pipeline latency of that source instruction.
Ø  A compiler’s ability to perform this scheduling depends bothon the amount of ILP available in the program and on the latencies of the functional units in the pipeline.
Instruction producing result             Instruction using result                 Latency in clock cycles
FP ALU op                                                       Another FP ALU op                                         3
FP ALU op                                                          Store double                                                   2
Load double                                                            FP ALU op                                                 1
Load double                                                             Store double                                               0
Latencies of FP operations
Ø  We assume the standard five-stage integer pipeline, so that branches have a delay of 1 clock cycle. We assume that the functional units are fully pipelined or replicated (as many times as the pipeline depth), so that an operation of any type can be issued on every clock cycle and there are no structural hazards.

Ø  This example serves both to illustrate an important technique as well as to motivate the more powerful program transformations
Ø  We will rely on the following code segment,which adds a scalar to a vector:
for (i=1000; i>0; i=i–1)
x[i] = x[i] + s;
Ø        We can see that this loop is parallel by noticing that the body of each iteration is independent.
                  The first step is to translate the above segment to MIPS assembly language. In the following code segment, R1 is initially the address of the element in the array with the highest address, and F2 contains the scalar value s. Register R2 is precomputed, so that 8(R2) is the address of the last element to operate on.
                               Loop:   L.D                F0,0(R1)           ;  F0=array element
                                           ADD.D           F4,F0,F2          ;add scalar in   
                                             S.D               F4,0(R1)           ;store result              
                                            DADDUI      R1,R1,#-8        ;decrement pointer
                                                                                            ;8 bytes (per DW)
                                            BNE              R1,R2,Loop      ;branch R1!=R2

                     In the previous example, we complete one loop iteration and store back one array element every 7 clock cycles, but the actual work of operating on the array element takes just 3 (the load, add, and store) of those 7 clock cycles
                     The remaining 4 clock cycles consist of loop overhead—the DADDUI and BNE—and two stalls. To eliminate these 4 clock cycles we need to get more operations relative to the number of overhead instructions.
                     A simple scheme for increasing the number of instructions relative to the branch and overhead instructions is loop unrolling. Unrolling simply replicates the loop body multiple times, adjusting the loop termination code.
                     Loop unrolling can also be used to improve scheduling. Because it eliminates the branch, it allows instructions from different iterations to be scheduled together
                     In this case, we can eliminate the data use stalls by creating additional independent instructions within the loop body. In this case, we can eliminate the data use stalls by creating additional independent instructions within the loop body. We will want to use different registers for each iteration, increasing the required number of registers. Without scheduling, every operation in the unrolled loop is followed by a dependent operation and thus will cause a stall.
                     This loop will run in 27 clock cycles—each LD has 1 stall, each ADDD 2, the DADDUI 1, plus 14 instruction issue cycles—or 6.75 clock cycles for each of the four elements, but it can be scheduled to improve performance significantly
 

1 comment:

  1. Well you need to organize stuff a bit! It is kinda messy.
    Cheers!

    ReplyDelete