DOC PREVIEW
UMD CMSC 411 - Lecture 23 Multiprocessors

This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CMSC 411CMSC 411Computer Systems ArchitectureLecture 23Multiprocessors, cont.Multiprocessors, cont.Alan Sussmanl@ d [email protected]• Finish reading Chapter 4•Homework 6 posted due next Tuesday•Homework 6 posted, due next Tuesday• Exam 2 answers questions?• Cache simulator project issues?• Course evaluations open, at http://www.CourseEvalUM.umd.eduCMSC 411 - 23 (some from Patterson, Sussman, others)2SNOOPING WRITE BACK PROTOCOLCMSC 411 - 23 (some from Patterson, Sussman, others)3Complications: Write Races• Cannot update cache until bus is obtained–Otherwise, another processor may get bus first,Otherwise, another processor may get bus first, and then write the same cache block!• Two step process:Arbitrate for bus–Arbitrate for bus – Place miss on bus and complete operation• If miss occurs to block while waiting for bus– handle miss (invalidate may be needed) and then restart.• Split transaction bus:–Bus transaction is not atomic: can have multiple outstanding transactions for a block– Multiple misses can interleave, allowing two caches to grab block in the Exclusive stateCMSC 411 - 23 (some from Patterson, Sussman, others)4allowing two caches to grab block in the Exclusive state» Must track and prevent multiple misses for one blockLimitations in Symmetric Shared-Memory Multiprocessors and Snooping Protocolsppg• Single memory to accommodate all CPUs⇒ Multiple memory banks• Bus must support both coherence traffic & normal memory trafficmemory traffic⇒ Multiple buses or interconnection networks (cross bar or small point-to-point)• Example - AMD Opteron:– Memory connected directly to each dual-core chip– Point-to-point connections for up to 4 chips– Remote memory and local memory latency are similar, allowing OS to view Opteron as uniform memory access (UMA) computerCMSC 411 - 23 (some from Patterson, Sussman, others)5Performance of Symmetric Shared-Memory MPs• Cache performance is combination of 1Uniprocessor cache miss traffic1.Uniprocessor cache miss traffic2. Traffic caused by communication »Results in invalidations and subsequent cache»Results in invalidations and subsequent cache misses• 4thC: coherency missJoins Compulsory Capacity Conflict–Joins Compulsory, Capacity, ConflictCMSC 411 - 23 (some from Patterson, Sussman, others)6Coherency Misses1. True sharing misses arise from the communication of data through the cache coherence mechanismg• Invalidates due to 1stwrite to shared block• Reads by another CPU of modified block in different cache• Miss would still occur if block size were 1 word2. False sharing misses when a block is invalidated because some word in the block other than the onebecause some word in the block, other than the one being read, is written into• Invalidation does not cause a new value to be communicated, but only causes an extra cache miss• Block is shared, but no word in block is actually shared⇒ miss would not occur if block size were 1 wordCMSC 411 - 23 (some from Patterson, Sussman, others)7Example: True v. False Sharing v. Hit?• Assume x1 and x2 in same cache block. P1 and P2 both read x1 and x2 beforeTime P1 P2True, False, Hit? Why?P1 and P2 both read x1 and x2 before.1Write x12 Read x2True miss; invalidate x1 in P2False miss; x1 irrelevant to P23Write x14Write x2False miss; x1 irrelevant to P2False miss; x1 irrelevant to P2False miss; x1 irrelevant to P24Write x25 Read x2False miss; x1 irrelevant to P2True miss; invalid x2 in P1CMSC 411 - 23 (some from Patterson, Sussman, others)8MP Performance 4 Processor Commercial Workload: OLTP, Decision Support (Database), Search Engine(), g33.25Instruction•True sharing and22.252.52.75InstructionCapacity/ConflictColdFalse SharingTrue Sharing•True sharing and false sharing unchanged going from 1 MB to 8 MB truction1251.51.752True Sharingfrom 1 MB to 8 MB (L3 cache)• Uniprocessor cache icles per Inst050.7511.25missesimprove withcache size increase (InstructionMemory) Cyc00.250.51 MB 2 MB 4 MB 8 MB(Instruction, Capacity/Conflict,Compulsory)(MCMSC 411 - 23 (some from Patterson, Sussman, others)9Cache sizeMP Performance 2MB Cache Commercial Workload: OLTP, Decision Support (Database), Search Engine(), g• True sharing,false sharing3InstructionConflict/Capacitfalse sharing increase going from 1 to 8 22.5Conflict/CapacityColdFalse SharingTrue SharingCPUs1.505100.512468CMSC 411 - 23 (some from Patterson, Sussman, others)1012468Processor countOutline• Review• Coherence• Write Consistency• Administrivia•Snooping•Snooping• Building Blocks• Snooping protocols and examplespgp p• Coherence traffic and Performance on MP• Directory-based protocols and examplesCli•ConclusionCMSC 411 - 23 (some from Patterson, Sussman, others)11A Cache Coherent System Must:y• Provide set of states, state transition diagram, and actions• Manage coherence protocol– (0) Determine when to invoke coherence protocol–(a) Find info about other copies to determine action(a) Find info about other copies to determine action» whether need to communicate with other cached copies– (b) Locate the other copies–(c) Communicate with those copies (invalidate/update)(c) Communicate with those copies (invalidate/update)• (0) is done the same way on all systems– state of the line is maintained in the cachet l i i k d if “ f lt” th li–protocol is invoked if an “access fault” occurs on the line• Different approaches distinguished by (a) to (c)CMSC 411 - 23 (some from Patterson, Sussman, others)12Bus-based Coherence• All of (a), (b), (c) done through broadcast on bus–faulting processor sends out a “search” gp– others respond to the search probe and take necessary action• Could do it in scalable network too– broadcast to all processors, and let them respond• Conceptually simple, but broadcast doesn’t scale withp(number of processors)scale with p (number of processors)– on bus, bus bandwidth doesn’t scale– on scalable network, every fault leads to at least p network transactionstransactions• Scalable coherence:– can have same cache states and state transition diagramdiff t h i t t lCMSC 411 - 23 (some from Patterson, Sussman, others)13–different mechanisms to manage protocolScalable Approach: Directories• Every memory block has associated directory information– keeps track of copies of cached blocks and their states– on a miss, find directory entry, look it up, and communicate only with the


View Full Document

UMD CMSC 411 - Lecture 23 Multiprocessors

Documents in this Course
Load more
Download Lecture 23 Multiprocessors
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 23 Multiprocessors and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 23 Multiprocessors 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?