Pipelining Pipelining is an implementation technique whereby multiple instructions are overlapped in execution (a key to make fast CPUs). Features: 1) Exploits parallelism among instructions in a sequential stream 2) Not visible to the programmer 3) Increases throughput, but doesn't reduce the execution time of an individual instruction 4) The clock must run at the speed of the slowest stage plus overhead The time per instruction on an ideal pipelined machine: Time per instruction on nonpipelined machine -------------------------------------------- Number of pipe stages 1. Basic Pipeline for DLX Five basic execution steps: 1) IF: instruction fetch 2) ID: instruction decode and register fetch 3) EX: execution and effective address calculation 4) MEM: memory access 5) WB: write back Example: A nonpipelined machine with steps: 5ns, 5ns, 6ns, 5ns, 5ns. Overhead is .5ns. What's the speedup of pipelined execution? New requirements for the data path: 1) PC incremented on each clock (in IF); 2) An instruction fetched on every clock (also in IF); 3) A new data word on every clock (in MEM); 4) Separate MDR for load and stores; 5) Three additional latches needed to hold the instruction, ALU output and next PC. Impact on memory: Memory bandwidth must be increased by 5 times, because 2 memory accesses are required on every clock (2 required on every 5 clock cycles in nonpipelined machine). 2. Pipeline Hazards Situations that prevent the next instruction from execution: 1) Structural hazards: resource conflicts; 2) Data hazards: data dependency; 3) Control hazards: branch effects. 2.1. Structural hazards - resource conflicts It happens when 1) some functional unit is not fully pipelined; or 2) some resource has not been duplicated enough. Solutions: stall or add more hardware Example for 2): load 1st inst. IF ID EX MM WB i+1 IF ID EX MM WB i+2 IF ID EX MM WB i+3 stall stall stall IF Floating point pipeline : FP operations may not be worth to pipeline if they are not clustered together 2.2. Data hazards - dependency Three type of data hazards: 1) RAW - want read after write 2) WAR - want write after read 3) WAW - want write after write Example: ADD R2, R3, R1 SUB R4, R1, R5 ADD IF ID EX MM WB (write data) SUB IF ID (read data) EX MM WB Main Solution: forwarding Pipeline interlock - detect a hazard and stall the pipeline until the hazard is clear LD R1, [R6+32] IF ID EX MM WB ADD R4, R1, R7 IF ID stall EX MM SUB R5, R1, R8 IF stall ID EX AND R6, R1, R7 stall IF ID The pipeline stall takes one clock cycle in this case. Pipeline scheduling - the compiler tries to schedule the pipeline to avoid stalls. For example: a = b + c; d = e - f; can be compiled to the scheduled code: LD B, Rb LD c, Rc LD e, Re ADD Rb, Rc, Ra LD f, Rf ST Ra, a SUB Re, Rf, Rd ST Rd, d Two interlocks are avoided. 2.3 Control hazards - branch effects Branch IF ID EX MM WB i+1 IF stall stall IF ID EX MM WB i+2 stall stall stall IF ID EX MM WB i+3 stall stall stall IF ID EX MM i+4 stall stall stall IF ID EX Solution: early action Two steps to reduce the number of stall cycles: 1) Find out whether the branch is taken or not; 2) Compute the taken PC. Both steps should be taken as early as possible. IF IR <--- Mem[PC]; PC <--- PC+4; ID A <--- Rs1; B <--- Rs2; BTA <--- PC+((IR21)^10 ## IR21..0) - 4 if (cond) PC <--- BTA (Two operations are shifted from EX and MEM to ID. This can reduce the stall cycles from 3 to 1. A separate adder is required to perform addition during ID) Four approaches to reducing branch hazards: 1) Simplest: stall all the way (as shown before) 2) Predict-not-taken: stall one clock cycle for each taken branch; Taken branch IF ID EX MM WB i+1 IF IF ID EX MM WB i+2 stall IF ID EX MM WB i+3 stall IF ID EX MM i+4 stall IF ID EX 3) Predict-taken: not discussed 4) Delayed branch: insert one or more irrelevant instructions after the branch, so stall is unnecessary. Software: make sure the inserted instructions are valid and useful;