DOC PREVIEW
Berkeley COMPSCI 252 - Lecture 5 – Projects + Prerequisite Quiz

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

EECS 252 Graduate Computer ArchitectureLec 5 – Projects + Prerequisite Quiz David PattersonElectrical Engineering and Computer SciencesUniversity of California, Berkeleyhttp://www.eecs.berkeley.edu/~pattrsnhttp://www-inst.eecs.berkeley.edu/~cs2522/1/2006 CS252-s06, Lec 05-projects + prereq2Review from last lecture #1/3: The Cache Design Space• Several interacting dimensions– cache size– block size– associativity– replacement policy– write-through vs write-back– write allocation• The optimal choice is a compromise– depends on access characteristics» workload» use (I-cache, D-cache, TLB)– depends on technology / cost• Simplicity often winsAssociativityCache SizeBlock SizeBadGoodLess MoreFactor A Factor B2/1/2006 CS252-s06, Lec 05-projects + prereq3Review from last lecture #2/3: Caches• The Principle of Locality:– Program access a relatively small portion of the address space at any instant of time.» Temporal Locality: Locality in Time» Spatial Locality: Locality in Space• Three Major Categories of Cache Misses:– Compulsory Misses: sad facts of life. Example: cold start misses.– Capacity Misses: increase cache size– Conflict Misses: increase cache size and/or associativity.Nightmare Scenario: ping pong effect!• Write Policy: Write Through vs. Write Back• Today CPU time is a function of (ops, cache misses) vs. just f(ops): affects Compilers, Data structures, and Algorithms2/1/2006 CS252-s06, Lec 05-projects + prereq4Review from last lecture #3/3: TLB, Virtual Memory• Page tables map virtual address to physical address• TLBs are important for fast translation• TLB misses are significant in processor performance– funny times, as most systems can’t access all of 2nd level cachewithout TLB misses!• Caches, TLBs, Virtual Memory all understood by examining how they deal with 4 questions: 1) Where can block be placed?2) How is block found? 3) What block is replaced on miss? 4) How are writes handled?• Today VM allows many processes to share single memory without having to swap all processes to disk; today VM protection is more important than memory hierarchy benefits, but computers insecure2/1/2006 CS252-s06, Lec 05-projects + prereq51. Algorithms, Programming Languages, Compilers, Operating Systems, Architectures, Libraries, … not ready for 1000 CPUs / chip2. Software people don’t start working hard until hardware arrives• 3 months after HW arrives, SW people list everything that must be fixed, then we all wait 4 years for next iteration of HW/SW3. How get 1000 CPU systems in hands of researchers to innovate in timely fashion on in algorithms, compilers, languages, OS, architectures, … ?4. Skip the waiting years between HW/SW iterations?Problems with Sea Change2/1/2006 CS252-s06, Lec 05-projects + prereq6Build Academic MPP from FPGAs • As ~ 25 CPUs fit in Field Programmable Gate Array, 1000-CPU system from ~ 40 FPGAs?• 16 32-bit simple “soft core” RISC at 150MHz in 2004 (Virtex-II)• FPGA generations every 1.5 yrs; ~2X CPUs, ~1.2X clock rate• HW research community does logic design (“gate shareware”) to create out-of-the-box, MPP– E.g., 1000 processor, standard ISA binary-compatible, 64-bit, cache-coherent supercomputer @ 200 MHz/CPU in 2007– RAMPants: Arvind (MIT), Krste Asanovíc (MIT), Derek Chiou (Texas), James Hoe (CMU), Christos Kozyrakis (Stanford), Shih-Lien Lu (Intel), Mark Oskin (Washington), David Patterson (Berkeley, Co-PI), Jan Rabaey (Berkeley), and John Wawrzynek (Berkeley, PI)• “Research Accelerator for Multiple Processors”2/1/2006 CS252-s06, Lec 05-projects + prereq7Characteristics of Ideal Academic CS Research Supercomputer?• Scale – Hard problems at 1000 CPUs• Cheap – 2006 funding of academic research• Cheap to operate, Small, Low Power – $ again• Community – share SW, training, ideas, …• Simplifies debugging – high SW churn rate• Reconfigurable – test many parameters, imitate many ISAs, many organizations, …• Credible – results translate to real computers• Performance – run real OS and full apps, results overnight 2/1/2006 CS252-s06, Lec 05-projects + prereq8Why RAMP Good for Research MPP? AAACScalability (1k CPUs)A (1.5 kw, 0.3 racks) A+ (.1 kw, 0.1 racks) D (120 kw, 12 racks)D (120 kw, 12 racks)Power/Space(kilowatts, racks)AAADCommunityAADACost of ownershipGPAPerform. (clock)CredibilityReconfigurabilityReproducibilityObservabilityCost (1k CPUs)CA (2 GHz)A+DBDF($40M)SMPB-A (3 GHz)A+CDCC ($2-3M)ClusterBF (0 GHz)FA+A+A+A+ ($0M) SimulateA-C (0.1-.2 GHz)AA+A+A+A ($0.1-0.2M) RAMP2/1/2006 CS252-s06, Lec 05-projects + prereq9• Completed Dec. 2004 (14x17 inch 22-layer PCB)• Module:– 5 Virtex II FPGAs, 18 banks DDR2-400 memory, 20 10GigE conn.– Administration/maintenance ports:» 10/100 Enet» HDMI/DVI» USB– ~$4K in Bill of Materials (w/o FPGAs or DRAM)RAMP 1 HardwareBEE2: Berkeley Emulation Engine 2By John Wawrzynek and Bob Brodersen with students Chen Chang and Pierre Droz2/1/2006 CS252-s06, Lec 05-projects + prereq10Multiple Module RAMP 1 Systems• 8 compute modules (plus power supplies) in 8U rack mount chassis• 2U single module tray for developers• Many topologies possible• Disk storage: via disk emulator + Network Attached Storage2/1/2006 CS252-s06, Lec 05-projects + prereq11Quick Sanity Check• BEE2 uses old FPGAs (Virtex II), 4 banks DDR2-400/cpu• 16 32-bit Microblazes per Virtex II FPGA, 0.75 MB memory for caches– 32 KB direct mapped Icache, 16 KB direct mapped Dcache• Assume 150 MHz, CPI is 1.5 (4-stage pipe) – I$ Miss rate is 0.5% for SPECint2000– D$ Miss rate is 2.8% for SPECint2000, 40% Loads/stores• BW need/CPU = 150/1.5*4B*(0.5% + 40%*2.8%) = 6.4 MB/sec• BW need/FPGA = 16*6.4 = 100 MB/s• Memory BW/FPGA = 4*200 MHz*2*8B = 12,800 MB/s• Plenty of room for tracing, …2/1/2006 CS252-s06, Lec 05-projects + prereq12RAMP Development Plan1. Distribute systems internally for RAMP 1 development Xilinx agreed to pay for production of a set of modules for initial contributing developers and first full RAMP system Others could be available if can recover costs2. Release publicly available out-of-the-box MPP emulator Based on standard ISA (IBM Power, Sun SPARC, …) for binary compatibility Complete OS/libraries Locally modify RAMP as desired3. Design next generation platform for RAMP 2 Base on 65nm FPGAs (2 generations later than Virtex-II) Pending results from RAMP 1, Xilinx


View Full Document

Berkeley COMPSCI 252 - Lecture 5 – Projects + Prerequisite Quiz

Documents in this Course
Quiz

Quiz

9 pages

Caches I

Caches I

46 pages

Lecture 6

Lecture 6

36 pages

Lecture 9

Lecture 9

52 pages

Figures

Figures

26 pages

Midterm

Midterm

15 pages

Midterm

Midterm

14 pages

Midterm I

Midterm I

15 pages

ECHO

ECHO

25 pages

Quiz  1

Quiz 1

12 pages

Load more
Download Lecture 5 – Projects + Prerequisite Quiz
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 5 – Projects + Prerequisite Quiz and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 5 – Projects + Prerequisite Quiz 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?