Definitely Outta Hand D OH Andrew Yee Richard Allen Nathan Wooster James Hillman TA Kelvin Lwin December 9 1999 1 Introduction and Summary a Feature Summary The Definitely Outta Hand D OH processor is based on the MIPS Instruction Set Its features include a five stage pipeline with forwarding two 16 word 2 way set associative data and instruction caches implementing a write back policy in fast page mode a 4 block victim cache and non blocking loads with a reorder buffer b Overall top level block diagram of the processor DRAM Controller Instruction Cache Instruction Cache Controller Processor Data Cache Controller MSHR Data Cache c Arbiter Victim Cache Controller Victim Cache A performance summary for the final test programs Using lab6 mystery as a benchmark Lab 6 Processor CPI 5 43 Lab 7 Processor CPI 4 93 9 2 performance increase 2 Description of Features Write back policy When a block in the cache is replaced by another block coming up from memory we look at the dirty bit associated with the block If the dirty bit is set we need to write back to main memory since the copy in the cache is more up to date We chose this option as we hope to save time when the block has not been written to dirty With the other option write through we write to main memory regardless of whether the block had been modifed Fast page mode We hoped to gain an advantage in speed by grabbing two words from DRAM by using two CAS L signals in rapid succession This is more complex since we can only grab successive even odd addresses rather than grabbing the address needed but one word at a time Victim cache This cache holds 8 words of data and sits between the data cache and the DRAM module We implemented this module in order to minimize the ping pong effect where data items bounce between the data cache and memory Non blocking loads MSHR We chose this so instructions that are not dependent on the result of the load can pass the memory stage and go directly to the reorder buffer instead of having the entire pipeline stall on any load regardless of dependencies This can potentially save cycles Reorder buffer This feature is closely coupled with the non blocking load scheme As the non load dependent instructions go past the MSHR we need to put them into the reorder buffer so that we need to commit things to the register file in the proper order This reduces stalls after a load word 3 Performance Summary a Critical Path i Top 3 critical paths in the processor 1 DRAM controller 26 ns per half cycle 52 ns 2 Transition from stall to no stall 26 ns per half cycle 52 ns MSHR DcacheCtrl MSHR StallCtrl MUX Reorder Buffer 3 JAL 16 ns per half cycle 32 ns register MUX adder Reorder Buffer ii Latencies from memory There is only a 10 ns latency to get data or instructions from the cache There is a 9 cycle latency from a missed cache request until getting the data or the instruction If the missed cache item happens to be in the victim cache then there is only a 4 cycle latency to get back the data Furthermore there is a five cycle latency for an instruction to get through the pipeline provided no stalls occur If there is a loadword and an instruction bypasses it because of our non blocking load system then there could be up to a 9 cycle delay for it s answer to get written from the reorder buffer to the register file b Performance Analysis i Comparison with the Lab 6 processor Using lab6 mystery as a benchmark Lab 6 Processor CPI 5 43 Lab 7 Processor CPI 4 93 9 2 performance increase Lab 6 cycle time 52 ns Lab 7 cycle time 52 ns 0 performance increase Lab 6 executed 2118 instructions Lab 7 executed 2118 instructions Lab 6 took 11500 cycles Lab 7 took 10441 cycles Lab 6 took 598 000 ns Lab 7 took 542 932 ns Using Lab5 mystery as a benchmark Almost identical performance ii Explanations for better worse performance The Lab 7 processor performed better mostly due to a combination of the nonblocking load scheme and the program we used as a benchmark Since the lab6 mystery program relied on sequential non dependent loads non blocking loads were very beneficial On a different program the results would be less encouraging Our performance was greatly diminished by the fact that we improperly did operations on both edges of the clock within the memory system Because of the way we allowed rising edge operations to leak into the processor things in several spots had to be completed in half a cycle thus doubling our cycle time 4 Testing Philosophy First of all we tested the individual blocks before integrating these blocks into the rest of the processor We used testbenches and command files with stimuli and probed the output comparing it with our expected output and attempted to test all the common cases as well as any special cases we could think of The memory was tested in multiple layers we added it one component at a time testing the system before the next component was added We used the mystery programs from the previous labs as tests for our processor We looked at waves for timing issues which was very useful We could probe values on buses at specific times Also the annotations with the values on the schematic were very useful at times Finally to verify the correct output from our processor we ran the mystery programs on SPIM and compared the output
View Full Document
Unlocking...