DOC PREVIEW
MIT 6 035 - Optimizer

This preview shows page 1-2-3-20-21-40-41-42 out of 42 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Slide 36Slide 37Slide 38Slide 39Slide 40Slide 41Project 5: Optimizer Jason AnselOverview • Project guidelines • Benchmarking Library • OoO CPUsProject Guidelines • Use optimizations from lectures as your arsenal – If you decide to implement one, look at Whale / Dragon • Use the provided programs to substantiate your implementation decisions. – Benchmark the provided programs on the target architecture. – Hand-implement the transformation first. – Don't waste time with ineffectual transformations. • Cover all of the transformations discussed in class, at the very least qualitatively.Project Guidelines • An analysis of each optimization considered `(you can group optimizations if you feel they are symbiotic). • Give reasons for your benchmarking results given your knowledge of the target architecture. • Analyze your generated assembly. Look for non-traditional peephole optimizations. • For final report, describe your full optimizations option and the ideas/experiments behind the implementation. – Phase order, convergence?Project Guidelines • Use GCC for experimentation – In many cases you can do better • Writeup needs to be very detailed • For each optimization: 1. Prove that it is a win by hand applying andbenchmarking 2. Give a detailed explanation of the implementation 3. Argue that it is general and correct • Just Step 1 for Design ProposalLibrary • You are provided with a library – lib6035.a – In /mit/6.035/provided/lib • You need to link against it and pthreads – gcc4 program.s lib6035.a –lpthread – Important to put your .s first! • Gives you ability to accurately benchmark your code, spawn and join multiple threads and read/write images (pgm)Benchmarking • We will use wall time to run your benchmark – Measured in us • You can benchmark the entire program and also a section of code you denote • Use start_caliper and end_calipercalls – The library will print out the time passed between the callsBenchmarking Example class Program { … void main () { read (); callout("start_caliper"); invert (); callout(“end_caliper"); write (); }} silver % ./negative Timer: 1352 usecs silver %AMD Opteron 270 • ~233 million transistors • 90nm • 2.0 GHz • Dual Core Image removed due to copyright restrictions. • 2 Processors per board • 64 KB L1 Ins Cache • 64 KB L1 data cache • 2 MB on-chip L2 cache • 12 Integer pipeline stagesOpteron’s Key Features • Multiple issue (3 instructions per cycle) • Register renaming • Out-of-order execution • Branch prediction • Speculative execution • Load/Store buffering • 2x Dual coreWhat is a Superscalar? • Any (scalar) processor that can “execute” more that one instruction per cycle. • Multiple instructions can be fetched per cycle • Decode logic can decide which instructions are independent – Decide when an instruction is ready to fire • Multiple functional units • All this for instruction-level parallelism!Multiple Issue • The Opteron is a 3-way superscalar: – decode & execute & retire 3 x86-64 instructions per cycle Instr'nTLBLevel 1 Instr'n CacheFetch 2 - TransitPickDecode 1 Decode 1 Decode 1Decode 2Decode 2Decode 2Pack Pack PackDecode Decode Decode8 - EntryScheduler8 - EntryScheduler8 - EntryScheduler36 - EntrySchedulerAGU AGUALU ALU ALU FADD FMUL FMISCAGU2kBranchTargets16kHistoryCounterRAS&Target AddressDataTLBLevel 1 Data Cache ECCLevel 2CacheL2 ECCL2 TagsL2 Tag ECCSystem RequestQueue (SRQ)Cross Bar(XBAR)Memory Controller&HyperTransportTMImage by MIT OpenCourseWare.Dependencies • Remember there are 3 types of dependencies: – True dependence • read after write (RAW) a = b + c c = a + b – Anti-dependence … • write after read (WAR) c = g + h – Output dependence • write after write (WAW)Register Renaming • Used to eliminate artificial dependencies – Anti-dependence (WAR) – Output dependence (WAW) • Basic rule – All instructions that are in-flight write to a distinct destination registerRegister Renaming • Registers of x86-64 ISA (%rax, %rsi, %r11, etc) are logical registers • When an instruction is decoded, its logical register destination is assigned to a physical register • The total number of physical (rename) registers is: – instructions_inflight + architectural regs = – 72 + 16 = 88 (plus some others)Register Renaming • Hardware maintains a mapping between the logical regs and the rename regs – Reorder Buffer (72 entries) • Architectural state is in 16 entry register file – 2 entries for each logical register – One speculative and one committed – Maintains architectural visible state – Used for exceptions and miss-predictionOut-of-Order Execution • The Opteron can also execute instructionsout of program order – Respect the data-flow (RAW) dependencies between instructions • Hardware schedulers executes an instruction whenever all of its operands areavailable – Not program order, dataflow order! • The Reorder Buffer orders instructions back into program orderOut-of-Order Execution • When an instruction is placed in the reorder buffer: – Get its operands • if no inflight writer of operand, get it from architectural register • if inflight writer, get writer’s ID from state and wait for value to be produced – Rename dest register, update state to remember this is most recent writer of dest • Reorder Buffer: – Forward results of instruction to consumers in reorder buffer – Write value to speculative architectural registerBranch Prediction • The outcome of a branch is known late in the pipeline • We don’t need to stall the pipeline while the branch is being resolved • Use local and global information to predict the nextPC for each instruction – Return address stack for ret instruction • Extremely accurate for integer codes – >95% in studies • Pipeline continues with the predicted PC as its nextPCSpeculative Execution • Continue execution of program with predicted next PC for conditional branches • An instruction can commit its result to architectural registers – In order: only after every prior instruction has committed – Means that slow instruction (ex. cache miss)


View Full Document

MIT 6 035 - Optimizer

Download Optimizer
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Optimizer and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Optimizer 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?