DOC PREVIEW
Berkeley COMPSCI 152 - The La Valium Processor

This preview shows page 1-2-3-4-5 out of 16 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Clock rate: 54 MhzThe La Valium Processor A CS152 Final ProjectBerkeley, CA12/9/99By:Nikhil AcharyaJohn LooSam WuEugenia ChienWest Yuet SuenTA:Victor Wen1Table Of Contents:Introduction and Summary……… pg. 3Feature Descriptions……………… pg. 4Performance Summary…………… pg. 13Critical Path………………… pg. 13Performance Analysis… …… pg. 15Testing Philosophy………………… pg. 15Appendix: (Please refer to supplemental file, “TheSecondFile.ps” for bulleted items below.)- Test Programs- VHDL- Online Logs- References for branch predictionSchematics: Please refer to supplemental files, “<schematic_name>.ps” => there will be 18 of these postscript file.NOTE: Please find submitted, the following files:- Lab7_writeup.doc- appendix.zip => this contains all 18 <schematic_name>.ps files And “TheSecondFile.ps”2Introduction and Summary:What Did We Do?The main features of our final processor include a 5-stage 2-way Super-ScalarDatapath, a 2-level Branch Prediction Unit, and an optimized memory sub-system. TheBranch Prediction Unit consists of a 2-bit Up-Down Saturation counter PHT, gSelectIndexing, and a PC Indexed Target Buffer. The memory sub-system was optimized byenabling Instruction Prefetching using a 4 word Stream Buffer, incorporating arandomized cache replacement policy, and by modifying our DRAM controller to returncontrol to the processor as soon as the data is returned from the memory. Please refer tothe “Feature Descriptions” section of our report for an in depth description of the featuresjust listed.Top-Level Block Diagram of The La Valium Processor:Spring 99 Mystery Program Performance Summary - How it ran on La Valium:3OptimizedMemorySub-SystemInstructionsDataPipeline UPipeline VSuper-Scalar Processor CoreBranchPredictionUnitWe present here just our processor’s performance statistics (when running lastsemester’s Mystery Program), since an in depth performance analysis follows later in thereport:- Clock rate: 54 Mhz- Execution time: 380.5-s = 20,569 cycles- CPI: 3.5- Paired instructions: 2,642- Unpaired instructions: 3,197- D-Cache stalls: 4,750 cycles- I-Cache stalls: 11,844 cyclesWe have run the “merge_sort.s” program on our processor as well, and it yielded resultsthat were very similar to those for the Spring 99 final mystery program (enumeratedabove).Feature Descriptions:Super-Scalar Processor Core:Motivation for superscalar: An in-order superscalar core was selected, becausewe believed that it would yield good performance gains without excessive developmenttime. We considered making a Tomasulo datapath, but the amount of labor necessary tocreate one was prohibitive. Also, it was necessary to make the Tomasulo core superscalaranyway if we wanted to reap the full benefits of out-of-order execution. Initially, westarted off with the idea to create a superscalar superpipelined processor, but it rapidlybecame clear that it was not a good idea. Superpipelining made cycle time the primaryconcern and design rather difficult. We also realized that the benefits fromsuperpipelining an already superscalar core were minimal because of the limited ILPavailable in adjacent instructions. Superpipelining involved splitting the ALU across twostages, which meant that it was not possible to forward the result from an instruction tothe one immediately following it. In a superscalar design, this can be done 50% of thetime since instructions are executed in pairs rather than as a continuously overlappingstream. Since superscalar was better at extracting parallelism and there was only alimited amount of parallelism available, there would be little ILP left that superpipeliningwould be able to extract. Since superpipelining would largely fail at extracting moreparallelism, it served only to increase latency as it executed instructions serially. As aresult, we decided to build a superscalar core with a short cycle time to reduce the latencybetween dependent adjacent instructions.Pipeline organization: We produced a standard 5-stage superscalar pipeline. Thetwo pipelines were symmetric except for the fact that one pipeline, the U pipeline, couldonly execute the even instructions while the V pipeline could only execute the oddinstructions. Forcing odd instructions to execute in one pipeline and even instructions inthe other reduced efficiency slightly, but it made pipeline implementation much easierthan if we only enforced the instructions in one pipeline to be one instruction earlier thanin the other pipeline for a corresponding stage like the Pentium. For those cases where4we only want to load one instruction into the pipeline like jumping to an odd address,correct behavior would be ensured by invalidating the instruction introduced into the U,or even, pipeline.Branching in EX: In the interests of keeping the clock rate high, we decided tomove the branching hardware out of the ID stage into the EX stage. Placing thebranching hardware in the ID stage incurs a high cost, because branch decision can onlyoccur after all the forwarding logic. Branching logic potentially required an additional 5ns which, in the context of a sub 20 ns cycle time, was a lot. Although moving branchinglater into the pipeline increases the miss penalty, it turns out that it the clock rate increasewill offset the extra penalty when loop lengths exceed two cycles even at a 50%prediction rate. With reasonable amounts of branch prediction, the pipeline will providesuperior performance quite readily. Although its made a lot of sense to take conditionalbranches in the EX stage, there was no need to take jumps like j, or jal in the EX stage aswell since we know that they were always taken. In spite of this, the pipeline would takethose branches in the EX stage to make PC changing hardware more uniform. Wereduced the penalty of doing this by feeding the data from the jumps into the branchprediction unit as well. Theoretically, there would be near perfect prediction rates, andwe wouldn’t need to make a special jump branching unit to change the PC in the IF stagefor zero cycle penalty jumps. As evidenced by the results, this was not entirely the case,but the prediction was accurate enough to reduce most of the penalty. It should be notedthat jr type jumps were taken in the EX stage, but its data was


View Full Document

Berkeley COMPSCI 152 - The La Valium Processor

Documents in this Course
Quiz 5

Quiz 5

9 pages

Memory

Memory

29 pages

Quiz 5

Quiz 5

15 pages

Memory

Memory

29 pages

Memory

Memory

35 pages

Memory

Memory

15 pages

Quiz

Quiz

6 pages

Midterm 1

Midterm 1

20 pages

Quiz

Quiz

12 pages

Memory

Memory

33 pages

Quiz

Quiz

6 pages

Homework

Homework

19 pages

Quiz

Quiz

5 pages

Memory

Memory

15 pages

Load more
Download The La Valium Processor
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view The La Valium Processor and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view The La Valium Processor 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?