Pipelining

Pipelining is an implementation technique whereby multiple instructions are overlapped in execution (a key to make fast CPUs).


Features:
1)   Exploits parallelism among instructions in a sequential stream
2)   Not visible to the programmer
3)   Increases throughput, but doesn't reduce the execution time of an
     individual instruction
4)   The clock must run at the speed of the slowest stage plus overhead

The time per instruction on an ideal pipelined machine:
            Time per instruction on nonpipelined machine
            --------------------------------------------
                      Number of pipe stages

1. Basic Pipeline for DLX 

Five basic execution steps:

1)	IF: instruction fetch
2)	ID: instruction decode and register fetch
3)	EX: execution and effective address calculation
4)	MEM: memory access
5)	WB: write back

Example: A nonpipelined machine with steps: 5ns, 5ns, 6ns, 5ns, 5ns. Overhead is .5ns. What's the speedup of pipelined execution?

New requirements for the data path:
1)	PC incremented on each clock (in IF);
2)	An instruction fetched on every clock (also in IF);
3)	A new data word on every clock (in MEM);
4)	Separate MDR for load and stores;
5)	Three additional latches needed to hold the instruction, ALU output 
        and next PC.

Impact on memory:

Memory bandwidth must be increased by 5 times, because 2 memory accesses are required on every clock (2 required on every 5 clock cycles in nonpipelined machine).

2. Pipeline Hazards

Situations that prevent the next instruction from execution:

1)	Structural hazards: resource conflicts;
2)	Data hazards: data dependency;
3)	Control hazards: branch effects.

2.1. Structural hazards - resource conflicts

It happens when
 
1)	some functional unit is not fully pipelined; or
2)	some resource has not been duplicated enough.

Solutions: stall or add more hardware

Example for 2):

load 1st inst.  IF     ID     EX     MM     WB
	i+1            IF     ID     EX     MM     WB
	i+2                   IF     ID     EX     MM     WB
	i+3                          stall  stall  stall  IF

Floating point pipeline :

FP operations may not be worth to pipeline if they are not clustered together

2.2. Data hazards - dependency

Three type of data hazards:

1) RAW - want read after write
2) WAR - want write after read
3) WAW - want write after write

Example:       ADD	R2, R3, R1
               SUB	R4, R1, R5

ADD     IF     ID     EX             MM     WB (write data)
SUB            IF     ID (read data) EX     MM     WB

Main Solution: forwarding

Pipeline interlock - detect a hazard and stall the pipeline until the hazard 
is clear

LD   R1, [R6+32]    IF     ID     EX     MM     WB
ADD  R4, R1, R7            IF     ID     stall  EX     MM
SUB  R5, R1, R8                   IF     stall  ID     EX
AND  R6, R1, R7                          stall  IF     ID

The pipeline stall takes one clock cycle in this case.

Pipeline scheduling - the compiler tries to schedule the pipeline to avoid stalls. 

For example:      a = b + c;
                  d = e - f;

can be compiled to the scheduled code:

		LD      B, Rb 
		LD      c, Rc 
		LD      e, Re 
		ADD     Rb, Rc, Ra 
		LD      f, Rf 
		ST      Ra, a 
		SUB	Re, Rf, Rd 
		ST      Rd, d 

Two interlocks are avoided.

2.3 Control hazards - branch effects

Branch    IF     ID     EX     MM     WB
i+1              IF     stall  stall  IF     ID     EX     MM     WB
i+2                     stall  stall  stall  IF     ID     EX     MM    WB
i+3                            stall  stall  stall  IF     ID     EX    MM
i+4                                   stall  stall  stall  IF     ID    EX

Solution: early action

Two steps to reduce the number of stall cycles:

1)	Find out whether the branch is taken or not;
2)	Compute the taken PC.

Both steps should be taken as early as possible.

	IF		IR <--- Mem[PC];	PC <--- PC+4;
	ID		A <--- Rs1;	B <--- Rs2;	
			BTA <--- PC+((IR21)^10 ## IR21..0) - 4
			if (cond) PC <--- BTA

(Two operations are shifted from EX and MEM to ID. This can reduce the stall 
cycles from 3 to 1. A separate adder is required to perform addition during ID)

Four approaches to reducing branch hazards:

1)	Simplest: stall all the way (as shown before)
2)	Predict-not-taken: stall one clock cycle for each taken branch;

Taken branch     IF     ID     EX     MM     WB
	i+1             IF     IF     ID     EX     MM     WB
	i+2                    stall  IF     ID     EX     MM     WB
	i+3                           stall  IF     ID     EX     MM
	i+4                                  stall  IF     ID     EX

3)	Predict-taken: not discussed
4)	Delayed branch: insert one or more irrelevant instructions after the 
        branch, so stall is unnecessary.

Software: make sure the inserted instructions are valid and useful;

<eolecture>