The La Valium Processor A CS152 Final Project Berkeley CA 12 9 99 By Nikhil Acharya John Loo Sam Wu Eugenia Chien West Yuet Suen TA Victor Wen 1 Table Of Contents Introduction and Summary pg 3 Feature Descriptions pg 4 Performance Summary pg 13 Critical Path pg 13 Performance Analysis pg 15 Testing Philosophy pg 15 Appendix Please refer to supplemental file TheSecondFile ps for bulleted items below Test Programs VHDL Online Logs References for branch prediction Schematics Please refer to supplemental files schematic name ps there will be 18 of these postscript file NOTE Please find submitted the following files Lab7 writeup doc appendix zip this contains all 18 schematic name ps files And TheSecondFile ps 2 Introduction and Summary What Did We Do The main features of our final processor include a 5 stage 2 way Super Scalar Datapath a 2 level Branch Prediction Unit and an optimized memory sub system The Branch Prediction Unit consists of a 2 bit Up Down Saturation counter PHT gSelect Indexing and a PC Indexed Target Buffer The memory sub system was optimized by enabling Instruction Prefetching using a 4 word Stream Buffer incorporating a randomized cache replacement policy and by modifying our DRAM controller to return control to the processor as soon as the data is returned from the memory Please refer to the Feature Descriptions section of our report for an in depth description of the features just listed Top Level Block Diagram of The La Valium Processor Super Scalar Processor Core Pipeline U Instructions Pipeline V Optimized Memory Sub System Branch Prediction Unit Data Spring 99 Mystery Program Performance Summary How it ran on La Valium 3 We present here just our processor s performance statistics when running last semester s Mystery Program since an in depth performance analysis follows later in the report Clock rate 54 Mhz Execution time 380 5 s 20 569 cycles CPI 3 5 Paired instructions 2 642 Unpaired instructions 3 197 D Cache stalls 4 750 cycles I Cache stalls 11 844 cycles We have run the merge sort s program on our processor as well and it yielded results that were very similar to those for the Spring 99 final mystery program enumerated above Feature Descriptions Super Scalar Processor Core Motivation for superscalar An in order superscalar core was selected because we believed that it would yield good performance gains without excessive development time We considered making a Tomasulo datapath but the amount of labor necessary to create one was prohibitive Also it was necessary to make the Tomasulo core superscalar anyway if we wanted to reap the full benefits of out of order execution Initially we started off with the idea to create a superscalar superpipelined processor but it rapidly became clear that it was not a good idea Superpipelining made cycle time the primary concern and design rather difficult We also realized that the benefits from superpipelining an already superscalar core were minimal because of the limited ILP available in adjacent instructions Superpipelining involved splitting the ALU across two stages which meant that it was not possible to forward the result from an instruction to the one immediately following it In a superscalar design this can be done 50 of the time since instructions are executed in pairs rather than as a continuously overlapping stream Since superscalar was better at extracting parallelism and there was only a limited amount of parallelism available there would be little ILP left that superpipelining would be able to extract Since superpipelining would largely fail at extracting more parallelism it served only to increase latency as it executed instructions serially As a result we decided to build a superscalar core with a short cycle time to reduce the latency between dependent adjacent instructions Pipeline organization We produced a standard 5 stage superscalar pipeline The two pipelines were symmetric except for the fact that one pipeline the U pipeline could only execute the even instructions while the V pipeline could only execute the odd instructions Forcing odd instructions to execute in one pipeline and even instructions in the other reduced efficiency slightly but it made pipeline implementation much easier than if we only enforced the instructions in one pipeline to be one instruction earlier than in the other pipeline for a corresponding stage like the Pentium For those cases where 4 we only want to load one instruction into the pipeline like jumping to an odd address correct behavior would be ensured by invalidating the instruction introduced into the U or even pipeline Branching in EX In the interests of keeping the clock rate high we decided to move the branching hardware out of the ID stage into the EX stage Placing the branching hardware in the ID stage incurs a high cost because branch decision can only occur after all the forwarding logic Branching logic potentially required an additional 5 ns which in the context of a sub 20 ns cycle time was a lot Although moving branching later into the pipeline increases the miss penalty it turns out that it the clock rate increase will offset the extra penalty when loop lengths exceed two cycles even at a 50 prediction rate With reasonable amounts of branch prediction the pipeline will provide superior performance quite readily Although its made a lot of sense to take conditional branches in the EX stage there was no need to take jumps like j or jal in the EX stage as well since we know that they were always taken In spite of this the pipeline would take those branches in the EX stage to make PC changing hardware more uniform We reduced the penalty of doing this by feeding the data from the jumps into the branch prediction unit as well Theoretically there would be near perfect prediction rates and we wouldn t need to make a special jump branching unit to change the PC in the IF stage for zero cycle penalty jumps As evidenced by the results this was not entirely the case but the prediction was accurate enough to reduce most of the penalty It should be noted that jr type jumps were taken in the EX stage but its data was not passed to the branch prediction unit This was necessary because the pipeline did not check to see if the branch target address was correct or not it would only verify taken or not taken Dealing with delay slots Besides deciding where to take branches we also had to deal with the issue of just how we would handle the delay slot Although a delay slot
View Full Document
Unlocking...