Unformatted text preview:

Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions Processor Cache Memory nodes connected by scalable network Distributed shared physical address space Communication assist must interpret network transactions forming shared address space For a system with shared physical address space A cache miss must be satisfied transparently from local or remote memory depending on address By its normal operation cache replicates data locally resulting in a potential cache coherence problem between local and remote copies of data A coherency solution must be in place for correct operation Standard snoopy protocols studied earlier do not apply for lack of a bus or a broadcast medium to snoop on For this type of system to be scalable in addition to latency and bandwidth scalability the cache coherence protocol or solution used must also scale as well EECC756 Shaaban 1 lec 14 Spring2002 5 9 2002 Functionality Expected In A Cache Coherent System Provide a set of states a state transition diagram and actions representing the cache coherence protocol used Manage coherence protocol 0 Determine when to invoke the coherence protocol a Find source of information about state of cache line in other caches Whether need to communicate with other cached copies b Find out the location or locations of other copies if any c Communicate with those copies invalidate update 0 is done the same way on all cache coherent systems State of the local cache line is maintained in the cache Protocol is invoked if an access fault occurs on the line Different approaches distinguished by a to c EECC756 Shaaban 2 lec 14 Spring2002 5 9 2002 Bus Based Coherence All of a b c done through broadcast on the bus Faulting processor sends out a search Others respond to the search probe and take necessary action This approach could be done in a scalable network too Broadcast to all processors and let them respond Conceptually simple but broadcast doesn t scale with p Bus bandwidth doesn t scale On a scalable network e g MINs every fault may lead to at least p network transactions EECC756 Shaaban 3 lec 14 Spring2002 5 9 2002 Scalable Cache Coherence A scalable cache coherence approach may have similar cache line states and state transition diagrams as in busbased coherence protocols However different additional mechanisms other than broadcasting must be devised to manage the coherence protocol Two possible approaches Approach 1 Hierarchical Snooping Approach 2 Directory based cache coherence Approach 3 A combination of the above two approaches EECC756 Shaaban 4 lec 14 Spring2002 5 9 2002 Approach 1 Hierarchical Snooping Extend snooping approach A hierarchy of broadcast media Tree of buses or rings KSR 1 Processors are in the bus or ring based multiprocessors at the leaves Parents and children connected by two way snoopy interfaces Snoop both buses and propagate relevant transactions Main memory may be centralized at root or distributed among leaves Issues a c handled similarly to bus but not full broadcast Faulting processor sends out search bus transaction on its bus Propagates up and down hierarchy based on snoop results Problems High latency multiple levels and snoop lookup at every level Bandwidth bottleneck at root This approach has for the most part been abandoned EECC756 Shaaban 5 lec 14 Spring2002 5 9 2002 Hierarchical Snoopy Cache Coherence Simplest way hierarchy of buses snoopy coherence at each level or rings Consider buses Two possibilities a All main memory at the global B2 bus b Main memory distributed among the clusters P P L1 P P L1 L1 L1 B1 P P P P L1 L1 L1 L1 B1 L2 Memory B2 B1 B1 L2 L2 L2 Memory B2 Main Memory Mp a b EECC756 Shaaban 6 lec 14 Spring2002 5 9 2002 Bus Hierarchies with Centralized Memory P P P P L1 L1 L1 L1 B1 B1 L2 L2 B2 Main Memory Mp B1 follows standard snoopy protocol Need a monitor per B1 bus Decides what transactions to pass back and forth between buses Acts as a filter to reduce bandwidth needs Use L2 cache Much larger than L1 caches set associative Must maintain inclusion Has dirty but stale bit per line L2 cache can be DRAM based since fewer references get to it EECC756 Shaaban 7 lec 14 Spring2002 5 9 2002 Bus Hierarchies with Centralized Memory Advantages and Disadvantages Advantages Simple extension of bus based scheme Misses to main memory require single traversal to root of hierarchy Placement of shared data is not an issue Disadvantages Misses to local data e g stack also traverse hierarchy Higher traffic and latency Memory at global bus must be highly interleaved for bandwidth EECC756 Shaaban 8 lec 14 Spring2002 5 9 2002 Bus Hierarchies with Distributed Memory P P P P L1 L1 L1 L1 B1 B1 Memory L2 L2 Memory B2 Main memory distributed among clusters Cluster is a full fledged bus based machine memory and all Automatic scaling of memory each cluster brings some with it Good placement can reduce global bus traffic and latency But latency to far away memory is larger EECC756 Shaaban 9 lec 14 Spring2002 5 9 2002 Scalable Approach 2 Directories A directory is composed of a number of directory entries Every memory block has an associated directory entry Keeps track of the nodes or processors that have cached copies of the memory block and their states On a miss find directory entry look it up and communicate only with the nodes that have copies if necessary In scalable networks communication with directory and nodes that have copies is through network transactions Many alternatives exist for organizing directory information EECC756 Shaaban 10 lec 14 Spring2002 5 9 2002 Organizing Directories Directory Schemes Centralized How to find source of directory information Distributed Flat Hierarchical How to locate copies Memory based Full Map Full Bit Vector Limited Directory Cache based chained directories Singly Linked chain Doubly Linked chain EECC756 Shaaban 11 lec 14 Spring2002 5 9 2002 Basic Operation of Centralized Directory P P Cache Cache Both memory and directory are centralized P processors With each cache block in memory P presence bits p i 1 dirty bit Interconnection Network Memory presence bits With each cache block in cache 1 valid bit and 1 dirty owner bit Directory dirty bit Read from main memory read miss by processor i If dirty bit OFF then read from main memory turn p i ON if dirty bit ON then recall line from dirty proc j cache state to shared update memory turn dirty bit OFF turn p i ON supply recalled data to i Write miss to main memory by


View Full Document

RIT EECC 756 - Scalable Cache Coherent Systems

Documents in this Course
Load more
Loading Unlocking...
Login

Join to view Scalable Cache Coherent Systems and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Scalable Cache Coherent Systems and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?