Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions Processor Cache Memory nodes connected by scalable network Distributed shared physical address space Communication assist must interpret network transactions forming shared address space For a system with shared physical address space A cache miss must be satisfied transparently from local or remote memory depending on address By its normal operation cache replicates data locally resulting in a potential cache coherence problem between local and remote copies of data A coherency solution must be in place for correct operation Standard snoopy protocols studied earlier may not apply for lack of a bus or a broadcast medium to snoop on For this type of system to be scalable in addition to latency and bandwidth scalability the cache coherence protocol or solution used must also scale as well EECC756 Shaaban 1 lec 13 Spring2000 4 27 2000 Functionality Expected In A Cache Coherent System Provide a set of states a state transition diagram and actions representing the cache coherence protocol used Manage coherence protocol 0 Determine when to invoke the coherence protocol a Find source of information about state of cache line in other caches Whether need to communicate with other cached copies b Find out the location or locations of other copies if any c Communicate with those copies invalidate update 0 is done the same way on all cache coherent systems State of the local cache line is maintained in the cache Protocol is invoked if an access fault occurs on the line Different approaches distinguished by a to c EECC756 Shaaban 2 lec 13 Spring2000 4 27 2000 Bus Based Coherence All of a b c done through broadcast on the bus Faulting processor sends out a search Others respond to the search probe and take necessary action This approach could be done in a scalable network too Broadcast to all processors and let them respond Conceptually simple but broadcast doesn t scale with p Bus bandwidth doesn t scale On a scalable network e g MINs every fault may lead to at least p network transactions EECC756 Shaaban 3 lec 13 Spring2000 4 27 2000 Scalable Cache Coherence A scalable cache coherence approach may have similar cache line states and state transition diagrams as in busbased coherence protocols However different additional mechanisms other than broadcasting must be devised to manage the coherence protocol Two possible approaches Approach 1 Hierarchical Snooping Approach 2 Directory based cache coherence Approach 3 A combination of the above two approaches EECC756 Shaaban 4 lec 13 Spring2000 4 27 2000 Approach 1 Hierarchical Snooping Extend snooping approach A hierarchy of broadcast media Tree of buses or rings KSR 1 Processors are in the bus or ring based multiprocessors at the leaves Parents and children connected by two way snoopy interfaces Snoop both buses and propagate relevant transactions Main memory may be centralized at root or distributed among leaves Issues a c handled similarly to bus but not full broadcast Faulting processor sends out search bus transaction on its bus Propagates up and down hierarchy based on snoop results Problems High latency multiple levels and snoop lookup at every level Bandwidth bottleneck at root This approach has for the most part been abandoned EECC756 Shaaban 5 lec 13 Spring2000 4 27 2000 Hierarchical Snoopy Cache Coherence Simplest way hierarchy of buses snoopy coherence at each level or rings Consider buses Two possibilities a All main memory at the global B2 bus b Main memory distributed among the clusters P P L1 P P L1 L1 L1 B1 P P P P L1 L1 L1 L1 B1 L2 Memory B2 B1 B1 L2 L2 L2 Memory B2 Main Memory Mp a b EECC756 Shaaban 6 lec 13 Spring2000 4 27 2000 Bus Hierarchies with Centralized Memory P P P P L1 L1 L1 L1 B1 B1 L2 L2 B2 Main Memory Mp B1 follows standard snoopy protocol Need a monitor per B1 bus Decides what transactions to pass back and forth between buses Acts as a filter to reduce bandwidth needs Use L2 cache Much larger than L1 caches set associative Must maintain inclusion Has dirty but stale bit per line L2 cache can be DRAM based since fewer references get to it EECC756 Shaaban 7 lec 13 Spring2000 4 27 2000 Bus Hierarchies with Centralized Memory Advantages and Disadvantages Advantages Simple extension of bus based scheme Misses to main memory require single traversal to root of hierarchy Placement of shared data is not an issue Disadvantages Misses to local data e g stack also traverse hierarchy Higher traffic and latency Memory at global bus must be highly interleaved for bandwidth EECC756 Shaaban 8 lec 13 Spring2000 4 27 2000 Bus Hierarchies with Distributed Memory P P P P L1 L1 L1 L1 B1 B1 Memory L2 L2 Memory B2 Main memory distributed among clusters Cluster is a full fledged bus based machine memory and all Automatic scaling of memory each cluster brings some with it Good placement can reduce global bus traffic and latency But latency to far away memory is larger EECC756 Shaaban 9 lec 13 Spring2000 4 27 2000 Scalable Approach 2 Directories Requestor Requestor 1 P C C Directory node for block 2 3 Read req to owner Reply with owner identity A C A M D C 4a Data Reply P C A M D Sharer a Read miss to a block in dirty state 4b Inval ack P C Node with dirty copy Directorynode 4a Inval ack 4b Revision message to directory M D 3b Inval req to sharer 3a Inval req to sharer M D C A P 2 Reply with sharers identity M D P A P RdEx request to directory P Read request to directory M D A 1 A M D Sharer b Write miss to a block with tw o sharers Many alternatives exist for organizing directory information EECC756 Shaaban 10 lec 13 Spring2000 4 27 2000 Basic Operation of Directory P P Cache Cache k processors With each cache block in memory k presence bits 1 dirty bit With each cache block in cache 1 valid bit and 1 dirty owner bit Interconnection Network Memory presence bits Directory dirty bit Read from main memory by processor i If dirty bit OFF then read from main memory turn p i ON if dirty bit ON then recall line from dirty proc cache state to shared update memory turn dirty bit OFF turn p i ON supply recalled data to i Write to main memory by processor i If dirty bit OFF then supply data to i send invalidations to all caches that have the block turn dirty bit ON turn p i ON EECC756 Shaaban 11 lec 13 Spring2000 4 27 2000 Organizing Directories Directory Schemes Centralized How to find source of directory information Distributed Flat Hierarchical How to locate
View Full Document
Unlocking...