Administrivia CMSC 411 Computer Systems Architecture Lecture 23 Multiprocessors cont Finish reading Chapter 4 Homework 6 posted posted due next Tuesday Exam 2 answers questions Cache simulator project issues Course evaluations open at http www CourseEvalUM umd edu Alan Sussman als cs umd edu l d d CMSC 411 23 some from Patterson Sussman others 2 Complications Write Races Cannot update cache until bus is obtained Otherwise another processor may get bus first and then write the same cache block Two step process Arbitrate for bus SNOOPING WRITE BACK PROTOCOL Place miss on bus and complete operation If miss occurs to block while waiting for bus handle miss invalidate may be needed and then restart Split transaction bus Bus transaction is not atomic can have multiple outstanding transactions for a block Multiple misses can interleave allowing two caches to grab block in the Exclusive state Must track and prevent multiple misses for one block CMSC 411 23 some from Patterson Sussman others 3 CMSC 411 23 some from Patterson Sussman others 4 Limitations in Symmetric Shared Memory Multiprocessors p and Snooping p g Protocols Performance of Symmetric Shared Memory MPs Single memory to accommodate all CPUs Cache performance is combination of 1 Uniprocessor cache miss traffic 1 Multiple memory banks 2 Traffic caused by communication Bus must support both coherence traffic normal memory traffic Results in invalidations and subsequent cache misses Multiple buses or interconnection networks cross bar or small point to point Example AMD Opteron 4th C coherency miss Memory connected directly to each dual core chip Joins Compulsory Compulsory Capacity Capacity Conflict Point to point connections for up to 4 chips Remote memory and local memory latency are similar allowing OS to view Opteron as uniform memory access UMA computer CMSC 411 23 some from Patterson Sussman others 5 Coherency Misses Invalidates due to 1st write to shared block Reads by another CPU of modified block in different cache Miss would still occur if block size were 1 word Assume x1 and x2 in same cache block P1 and P2 both read x1 and x2 before before 2 False sharing misses when a block is invalidated because some word in the block block other than the one being read is written into Time P1 1 Write x1 2 3 Invalidation does not cause a new value to be communicated but only causes an extra cache miss 4 Block is shared but no word in block is actually shared miss would not occur if block size were 1 word 5 CMSC 411 23 some from Patterson Sussman others 6 Example True v False Sharing v Hit 1 True sharing misses arise from the communication of data through g the cache coherence mechanism CMSC 411 23 some from Patterson Sussman others 7 P2 True False Hit Why True miss invalidate x1 in P2 Read x2 False miss x1 irrelevant to P2 Write x1 False miss x1 irrelevant to P2 Write x2 False miss x1 irrelevant to P2 Read x2 True miss invalid x2 in P1 CMSC 411 23 some from Patterson Sussman others 8 MP Performance 2MB Cache Commercial Workload OLTP Decision Support Database Search Engine g MP Performance 4 Processor Commercial Workload OLTP Decision Support Database Search Engine g 3 25 L3 cache Uniprocessor cache misses i improve with cache size increase Instruction Instruction Capacity Conflict Compulsory Instruction Capacity Conflict Cold False Sharing True Sharing 2 75 M Memory Cyc cles per Insttruction True sharing and false sharing unchanged going from 1 MB to 8 MB 3 2 5 2 25 2 1 75 3 2 5 2 Instruction Conflict Capacit Conflict Capacity Cold False Sharing True Sharing 1 5 1 5 1 25 1 25 1 1 0 75 05 0 5 05 0 5 0 25 0 1 MB 2 MB 4 MB 8 MB Cache size CMSC 411 23 some from Patterson Sussman others 9 Outline True sharing false sharing increase going from 1 to 8 CPUs 0 1 2 4 Processor count 6 CMSC 411 23 some from Patterson Sussman others 8 10 A Cache Coherent System y Must Review Coherence Write Consistency Administrivia Snooping Building Blocks p g protocols p and examples p Snooping Coherence traffic and Performance on MP Directory based protocols and examples C Conclusion l i Provide set of states state transition diagram and actions Manage coherence protocol 0 Determine when to invoke coherence protocol a Find info about other copies to determine action whether need to communicate with other cached copies b Locate the other copies c Communicate with those copies invalidate update 0 is done the same way on all systems state of the line is maintained in the cache protocol t l is i invoked i k d if an access fault f lt occurs on the th line li Different approaches distinguished by a to c CMSC 411 23 some from Patterson Sussman others 11 CMSC 411 23 some from Patterson Sussman others 12 Bus based Coherence Scalable Approach Directories Every memory block has associated directory information All of a b c done through broadcast on bus faulting g processor p sends out a search others respond to the search probe and take necessary action keeps track of copies of cached blocks and their states on a miss find directory entry look it up and communicate only with the nodes that have copies if necessary in i scalable l bl networks t k communication i ti with ith directory di t and d copies is through network transactions Could do it in scalable network too broadcast to all processors and let them respond Conceptually simple but broadcast doesn t scale with p number of processors Many alternatives for organizing directory information on bus bus bandwidth doesn t scale on scalable network every fault leads to at least p network transactions Scalable coherence can have same cache states and state transition diagram different diff t mechanisms h i to t manage protocol t l CMSC 411 23 some from Patterson Sussman others 13 Basic Operation of Directory P P Cache Cache Interconnection Network Memory presence bits Directory 14 CMSC 411 23 some from Patterson Sussman others Picture To Keep In Mind k processors P1 With each cache block in memory k presence bits 1 dirty bit Mem1 P2 Mem2 With each h cache block h bl k iin cache h 1 valid bit and 1 dirty owner bit dirty bit Read from main memory by processor i If dirty bit OFF then read from main memory turn p i ON if dirty bit y ON then recall line from dirty yp proc cache state to shared update memory turn dirty bit OFF turn p i ON supply recalled data to i Cost structure cache cac e local memory remote memory Write to main memory by processor i If dirty bit di t bit OFF then th supply l data d t to t i i send d
View Full Document