1EECS 252 Graduate Computer Architecture Lec 21 – Goodbye to Computer Architecture, Bad Career Alternatives, & Cal Cultural HistoryDavid PattersonElectrical Engineering and Computer SciencesUniversity of California, Berkeleyhttp://www.eecs.berkeley.edu/~pattrsnhttp://vlsi.cs.berkeley.edu/cs252-s06 2Outline• Goodbye to Computer Architecture• Part IV: Key Advice on Alternatives to a Bad Career• (Post-Tenure) Career Advice from Richard Hamming, Ivan Sutherland, Karl Pister• Your Cal Cultural Heritage• Course Evaluations• Return projects, final grades3Chapter 1: Introduction• CS 252 to learn new skills, transition to research• Computer Science at the crossroads from sequential to parallel computing– Salvation requires innovation in many fields, including computerarchitecture• Computer Architecture skill sets are different – Technology tracking and anticipation– Solid interfaces that really work– Quantitative approach to design– 5 Quantitative principles of design:1. Take Advantage of Parallelism 2. Principle of Locality3. Focus on the Common Case4. Amdahl’s Law5. The Processor Performance Equation• Computer Architecture >> instruction sets4Chapter 2: Dynamic Execution Processor• Leverage Implicit Parallelism for Performance: Instruction LevelParallelism• Loop unrolling by compiler to increase ILP• Branch prediction to increase ILP• Dynamic HW exploiting ILP– Works when can’t know dependence at compile time– Can hide L1 cache misses– Code for one machine runs well on another• Reservations stations: renaming to larger set of registers + buffering source operands– Prevents registers as bottleneck– Avoids WAR, WAW hazards– Allows loop unrolling in HW• Not limited to basic blocks (integer units gets ahead, beyond branches)• Helps cache misses as well• Lasting Contributions– Dynamic scheduling, Register renaming, Load/store disambiguation5Chapter 3: Static Instruction Level Parallelism• Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for practical options• Explicitly parallel (Data level parallelism or Thread level parallelism) is next step to performance• Coarse grain vs. Fine grained multithreading– Only on big stall vs. every clock cycle• Simultaneous Multithreading if fine grained multithreading based on OOO superscalar microarchitecture– Instead of replicating registers, reuse rename registers• Itanium/EPIC/VLIW is not a breakthrough in ILP• Balance of ILP and TLP decided in marketplace6Appendix H: Vector Architecture• Vector is alternative model for exploiting ILP• If code is vectorizable, then simpler hardware, more energy efficient, and better real-time model than Out-of-order machines• Design issues include number of lanes, number of functional units, number of vector registers, length of vector registers, exception handling, conditional operations• Fundamental design issue is memory bandwidth– With virtual address translation and caching7Goodbye to Processors & Questions for 2011• Did vector ever become popular for multimedia and/or signal processing?• Did EPIC/VLIW die?• Given switch the power limits and the switch to multiprocessors, did microarchitecture complexity 1.Continue to increase ( >6 issue, >6 complete) 2.Freeze as of ~ 2002 (e.g., Power 4, Opteron)? 3.or Regress? (e.g., Niagara)?• Did fine grain multithreading help with efficiency and/or did it make the parallel programming harder since scale of parallelism is larger and declined in popularity?8Chapter 4: Multiprocessors• Caches contain all information on state of cached memory blocks • Snooping cache over shared medium for smaller MP by invalidating other cached copies on write• Sharing cached data ⇒ Coherence (values returned by a read), Consistency (when a written value will be returned by a read)• Snooping and Directory Protocols similar; bus makes snooping easier because of broadcast (snooping ⇒uniform memory access)• Directory has extra data structure to keep track of state of all cache blocks• Distributing directory ⇒ Scalable shared address multiprocessor ⇒ Cache coherent, Non uniform memory access9Goodbye to MultiProcessors & Questions for 2011• Did processors / chip continue 2X every 2 yrs? – Desktop vs. servers vs. embedded– How many sockets can easily put together?• Any innovations in synchronization or communi-cation taking advantage of cores on the same chip?• Did everything remain cache coherent all the time, or can turn off and use just message passing?• Any changes to the ISA to make it easier to support parallel programming? (e.g., Transactional memory?)• Did we need new languages/compilers to use them? Threads vs. messages? MPI still popular? • Did enhancements to performance accountability/ predictability simplify parallel programming?• Performance/Power only, or SPUR too? • What % of peak did PETAFLOPS machine deliver?10Chapter 5: Advanced Memory Hierarchy• Memory wall inspires optimizations since so much performance lost there– Reducing hit time: Small and simple caches, Way prediction, Trace caches– Increasing cache bandwidth: Pipelined caches, Multibanked caches, Nonblocking caches– Reducing Miss Penalty: Critical word first, Merging write buffers– Reducing Miss Rate: Compiler optimizations– Reducing miss penalty or miss rate via parallelism:Hardware prefetching, Compiler prefetching• “Auto-tuners” search replacing static compilation to explore optimization space?• DRAM – Continuing Bandwidth innovations: Fast page mode, Synchronous, Double Data Rate11Chapter 5: Advanced Memory Hierarchy• Virtual Machine Revival– Overcome security flaws of modern OSes– Processor performance no longer highest priority– Manage software, manage hardware• Virtualization challenges for processor, virtual memory, I/O– Paravirtualization, ISA upgrades to cope with those difficulties– Xen as example VMM using paravirtualization12Goodbye to Memory Hierarchy & Questions for 2011• What was maximum number of levels of cache on chip? Off chip? (Did it go beyond 3?)• Did local memory (IBM Cell) replace/enhance caches?• Did DRAM latency improve? How much better DRAM BW? Rate of improvement in capacity? 2X / 3 years?• Did Virtual Machines become very popular? Which was primary reason?Enhance Security, Manage Hardware, Manage Software?• Is software shipped inside a VM?• Is x86 now as low
View Full Document