DOC PREVIEW
Berkeley COMPSCI C267 - Shared Memory Machines Programming Example: Sharks and Fish

This preview shows page 1-2-3-4-28-29-30-31-57-58-59-60 out of 60 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 60 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 60 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 60 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 60 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 60 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 60 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 60 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 60 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 60 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 60 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 60 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 60 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 60 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS 267: Shared Memory Machines Programming Example: Sharks and FishBasic Shared Memory ArchitectureOutlineProcessor-DRAM Gap (latency)Shared Memory Code for Computing a Sum s = f(A[0]) + f(A[1])Approaches to Building Parallel MachinesShared Cache: Advantages and DisadvantagesEvolution of Shared CacheSlide 10Intuitive Memory ModelSequential Consistency IntuitionMemory Consistency SemanticsIf Caches are Not “Coherent”Snoopy Cache-Coherence ProtocolsLimits of Bus-Based Shared MemorySample MachinesSlide 18Basic Choices in Memory/Cache CoherenceSGI Altix 3000Cache Coherence and Sequential ConsistencyProgramming with Weaker Memory Models than SCImproved Code for Computing a Sum s = f(A[0]) + … + f(A[n-1])Slide 24Caches and Scientific ComputingSharing: A Performance ProblemWhat to Take Away?Creating Parallelism with ThreadsProgramming with ThreadsLanguage Notions of Thread CreationForking Posix ThreadsPosix Thread ExampleLoop Level ParallelismShared Data and ThreadsBasic Types of Synchronization: BarrierBasic Types of Synchronization: MutexesA Model Problem: Sharks and FishParticle SystemsForces in Particle SystemsParallelism in External ForcesParallelism in Nearby ForcesParallelism in Far-Field ForcesExamine Sharks and Fish codeExtra SlidesEngineering: Intel Pentium Pro QuadEngineering: SUN EnterpriseSlide 4760s Mainframe Multiprocessors70s Breakthrough: CachesTechnology PerspectiveExample: Write-thru InvalidateWrite-Back/Ownership SchemesDirectory-Based Cache-Coherence90 Scalable, Cache Coherent MultiprocessorsCache Coherence and Memory ConsistencyViolations of Sequential ConsistencySufficient Conditions for Sequential ConsistencyClassification for Relaxed ModelsSome Current System-Centric ModelsData-Race-Free-0: Some DefinitionsCache-Coherent Shared Memory and Performance01/31/2005 CS267 Lecture 41CS 267: Shared Memory MachinesProgrammingExample: Sharks and Fish James [email protected] www.cs.berkeley.edu/~demmel/cs267_Spr0501/31/2005 CS267 Lecture 42Basic Shared Memory Architecture•Processors all connected to a large shared memory•Where are caches?•Now take a closer look at structure, costs, limits, programmingP1interconnectmemoryP2Pn01/31/2005 CS267 Lecture 43Outline•Evolution of Hardware and Software•CPUs getting exponentially faster than memory they share•Hardware evolves to try to match speeds•Program semantics evolve too•Programs change from correct to buggy, unless programmed carefully•Performance evolves as well•Well tuned programs today may be inefficient tomorrow•Goal: teach a programming style likely to stay correct, if not always as efficient as possible•Use locks to avoid race conditions•Current research seeks best of both worlds•Example: Sharks and Fish (part of next homework)01/31/2005 CS267 Lecture 44Processor-DRAM Gap (latency)µProc60%/yr.DRAM7%/yr.110100100019801981198319841985198619871988198919901991199219931994199519961997199819992000DRAMCPU1982Processor-MemoryPerformance Gap:(grows 50% / year)PerformanceTime“Moore’s Law”01/31/2005 CS267 Lecture 45Shared Memory Code for Computing a Sums = f(A[0]) + f(A[1])Thread 0 s = s + f(A[0])Thread 1 s = s + f(A[1])static int s = 0;•Might get f(A[0]) + f(A[1]) or f(A[0]) or f(A[1])•Problem is a race condition on variable s in the program•A race condition or data race occurs when:-two processors (or two threads) access the same variable, and at least one does a write.-The accesses are concurrent (not synchronized) so they could happen simultaneously01/31/2005 CS267 Lecture 46Approaches to Building Parallel MachinesP1SwitchMain memoryPn(Interleaved)(Interleaved)First-level $P1$Interconnection network$PnMemMemP1$Interconnection network$PnMemMemShared CacheCentralized MemoryUMA = Uniform Memory AccessDistributed Memory (NUMA = Non-UMA)Scale01/31/2005 CS267 Lecture 47Shared Cache: Advantages and DisadvantagesAdvantages•Cache placement identical to single cache•Only one copy of any cached block•Can’t have values of same memory location in different caches•Fine-grain sharing is possible•Good Interference•One processor may prefetch data for another•Can share data within a line without moving lineDisadvantages•Bandwidth limitation•Bad Interference•One processor may flush another processors data01/31/2005 CS267 Lecture 49Evolution of Shared Cache•Alliant FX-8 (early 1980s)•eight 68020s with x-bar to 512 KB interleaved cache•Encore & Sequent (1980s)•first 32-bit micros (N32032)•two to a board with a shared cache•Disappeared for a while, and then …•Cray X1 shares L3 cache•IBM Power 4 and Power 5 share L2 cache•If switch and cache on chip, may have enough bandwidth again01/31/2005 CS267 Lecture 410Approaches to Building Parallel MachinesP1SwitchMain memoryPn(Interleaved)(Interleaved)First-level $P1$Interconnection network$PnMemMemP1$Interconnection network$PnMemMemShared CacheCentralized MemoryUMA = Uniform Memory AccessDistributed Memory (NUMA = Non-UMA)Scale11CS267 Lecture 401/31/2005Intuitive Memory Model•Reading an address should return the last value written to that address•Easy in uniprocessors•except for I/O•Cache coherence problem in MPs is more pervasive and more performance critical•More formally, this is called sequential consistency:“A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.” [Lamport, 1979]01/31/2005 CS267 Lecture 412Sequential Consistency Intuition•Sequential consistency says the machine behaves as if it does the followingmemoryP0 P1 P2 P301/31/2005 CS267 Lecture 413Memory Consistency SemanticsWhat does this imply about program behavior?•No process ever sees “garbage” values, I.e., ½ of 2 values•Processors always see values written by some some processor•The value seen is constrained by program order on all processors•Time always moves forward•Example: spin lock•P1 writes data=1, then writes flag=1•P2 waits until flag=1, then reads datainitially: flag=0 data=0data = 1flag = 110: if flag=0, goto 10…= dataIf P2 sees the new value of flag (=1), it must see the new value of data (=1)P1P2If P2 reads flagThen P2 may read data0 10 01 101/31/2005 CS267 Lecture 414If Caches are Not “Coherent”•Coherence means different copies of same


View Full Document

Berkeley COMPSCI C267 - Shared Memory Machines Programming Example: Sharks and Fish

Documents in this Course
Lecture 4

Lecture 4

52 pages

Split-C

Split-C

5 pages

Lecture 5

Lecture 5

40 pages

Load more
Download Shared Memory Machines Programming Example: Sharks and Fish
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Shared Memory Machines Programming Example: Sharks and Fish and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Shared Memory Machines Programming Example: Sharks and Fish 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?