DOC PREVIEW
Berkeley COMPSCI 61C - Thread Level Parallelism

This preview shows page 1-2-3-4-29-30-31-32-33-60-61-62-63 out of 63 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Slide 1Slide 2You Are Here!AgendaAgendaParallel Processing: Multiprocessor Systems (MIMD)Transition to MulticoreMultiprocessors and YouPotential Parallel Performance (Assuming SW can use it!)Three Key Questions about MultiprocessorsThree Key Questions about MultiprocessorsThree Key Questions about MultiprocessorsExample: Sum ReductionExample: Sum ReductionAn Example with 10 ProcessorsAn Example with 10 ProcessorsThree Key Questions about MultiprocessorsShared Memory and CachesShared Memory and CachesAgendaCourse OrganizationMidterm ResultsEECS Grading PolicyAdministriviaAdministriviaCS 61c in the NewsSlide 28AgendaKeeping Multiple Caches CoherentHow Does HW Keep $ Coherent?Slide 32Name of Common Cache Coherency Protocol: MOESICache Coherency and Block SizeThreadsData Races and SynchronizationLock and Unlock SynchronizationPossible Lock/Unlock ImplementationPossible Lock ProblemHelp! Hardware SynchronizationSynchronization in MIPSTest-and-SetTest-and-Set in MIPSAgendaSlide 45AgendaUltrasparc T1 Die PhotoMachines in 61C LabRandy’s LaptopOpenMPOpenMPOpenMP Programming ModelOpenMP Programming ModelOpenMP Uses the C Extension Pragmas MechanismBuilding Block: the for loopOpenMP: Parallel for pragmaThread CreationOMP_NUM_THREADSParallel Threads and ScopeHello World in OpenMPHello World in OpenMPOpenMP DirectivesOpenMP Critical SectionAnd In Conclusion, …CS 61C: Great Ideas in Computer Architecture (Machine Structures)Thread Level ParallelismInstructors:Randy H. KatzDavid A. Pattersonhttp://inst.eecs.Berkeley.edu/~cs61c/sp1101/14/2019 1Spring 2011 -- Lecture #1501/14/2019 Spring 2011 -- Lecture #15 2You Are Here!•Parallel RequestsAssigned to computere.g., Search “Katz”•Parallel ThreadsAssigned to coree.g., Lookup, Ads•Parallel Instructions>1 instruction @ one timee.g., 5 pipelined instructions•Parallel Data>1 data item @ one timee.g., Add of 4 pairs of words•Hardware descriptionsAll gates functioning in parallel at same time01/14/2019 Spring 2011 -- Lecture #15 3SmartPhoneWarehouse Scale ComputerSoftware HardwareHarnessParallelism &Achieve HighPerformanceLogic Gates CoreCoreCoreCore… Memory (Cache) Memory (Cache)Input/OutputInput/OutputComputerMain MemoryMain MemoryCore Instruction Unit(s) Instruction Unit(s) FunctionalUnit(s) FunctionalUnit(s)A3+B3A2+B2A1+B1A0+B0Project 3Today’sLectureAgenda•Multiprocessor Systems•Administrivia•Multiprocessor Cache Consistency•Synchronization•Technology Break•OpenMP Introduction•Summary01/14/2019 4Spring 2011 -- Lecture #15Agenda•Multiprocessor Systems•Administrivia•Multiprocessor Cache Consistency•Synchronization•Technology Break•OpenMP Introduction•Summary01/14/2019 5Spring 2011 -- Lecture #15Parallel Processing:Multiprocessor Systems (MIMD)•Multiprocessor (MIMD): a computer system with at least 2 processors1. Deliver high throughput for independent jobs via request-level or task-level parallelism2. Improve the run time of a single program that has been specially crafted to run on a multiprocessor - a parallel processing programNow Use term core for processor (“Multicore”) because “Multiprocessor Microprocessor” is redundantProcessorProcessor ProcessorCache Cache CacheInterconnection NetworkMemoryI/O01/14/2019 6Spring 2011 -- Lecture #15Transition to MulticoreSequential App Performance01/14/2019 7Spring 2011 -- Lecture #15Multiprocessors and You•Only path to performance is parallelism–Clock rates flat or declining–SIMD: 2X width every 3-4 years• 128b wide now, 256b 2011, 512b in 2014?, 1024b in 2018?–MIMD: Add 2 cores every 2 years: 2, 4, 6, 8, 10, …•Key challenge is to craft parallel programs that have high performance on multiprocessors as the number of processors increase – i.e., that scale–Scheduling, load balancing, time for synchronization, overhead for communication•Project #3: fastest matrix multiply code on 8 processor (8 cores) computers–2 chips (or sockets)/computer, 4 cores/chip01/14/2019 Spring 2011 -- Lecture #15 8Potential Parallel Performance (Assuming SW can use it!)Year CoresSIMD bits /CoreCore *SIMD bitsPeak DP GFLOPs2003 2 128 256 42005 4 128 512 82007 6 128 768 122009 8 128 1024 162011 10 256 2560 402013 12 256 3072 482015 14 512 7168 1122017 16 512 8192 1282019 18 1024 18432 2882021 20 1024 20480 32001/14/2019 Spring 2011 -- Lecture #15 92.5X8X20XMIMD SIMDMIMD*SIMD+2/2yrs2X/4yrsThree Key Questions about Multiprocessors•Q1 – How do they share data?•Q2 – How do they coordinate?•Q3 – How many processors can be supported?01/14/2019 10Spring 2011 -- Lecture #15Three Key Questions about Multiprocessors•Q1 – How do they share data?•Single address space shared by all processors/cores01/14/2019 11Spring 2011 -- Lecture #15Three Key Questions about Multiprocessors•Q2 – How do they coordinate?•Processors coordinate/communicate through shared variables in memory (via loads and stores)–Use of shared data must be coordinated via synchronization primitives (locks) that allow access to data to only one processor at a time•All multicore computers today are Shared Memory Multiprocessors (SMPs)01/14/2019 12Spring 2011 -- Lecture #15Example: Sum Reduction•Sum 100,000 numbers on 100 processor SMP–Each processor has ID: 0 ≤ Pn ≤ 99–Partition 1000 numbers per processor–Initial summation on each processor:sum[Pn] = 0;for (i = 1000*Pn; i < 1000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i];•Now need to add these partial sums–Reduction: divide and conquer–Half the processors add pairs, then quarter, …–Need to synchronize between reduction steps01/14/2019 13Spring 2011 -- Lecture #15Example: Sum Reductionhalf = 8;repeat synch(); if (half%2 != 0 && Pn == 0) sum[0] = sum[0] + sum[half-1]; /* Conditional sum needed when half is odd; Processor0 gets extra element */ half = half/2; /* dividing line on who sums */ if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];until (half == 1);01/14/2019 14Spring 2011 -- Lecture #15This code executes simultaneouslyin P0, P1, …, P7Student Roulette?An Example with 10 ProcessorsP0 P1 P2 P3 P4 P5 P6 P7 P8 P9sum[P0] sum[P1] sum[P2] sum[P3] sum[P4] sum[P5] sum[P6] sum[P7] sum[P8] sum[P9]half = 1001/14/2019 15Spring 2011 -- Lecture #15An Example with 10 ProcessorsP0 P1 P2 P3 P4 P5 P6 P7 P8 P9sum[P0] sum[P1] sum[P2] sum[P3] sum[P4] sum[P5] sum[P6] sum[P7] sum[P8] sum[P9]P0P0 P1 P2 P3 P4half = 10half = 5P1half =


View Full Document

Berkeley COMPSCI 61C - Thread Level Parallelism

Documents in this Course
SIMD II

SIMD II

8 pages

Midterm

Midterm

7 pages

Lecture 7

Lecture 7

31 pages

Caches

Caches

7 pages

Lecture 9

Lecture 9

24 pages

Lecture 1

Lecture 1

28 pages

Lecture 2

Lecture 2

25 pages

VM II

VM II

4 pages

Midterm

Midterm

10 pages

Load more
Download Thread Level Parallelism
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Thread Level Parallelism and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Thread Level Parallelism 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?