New version page

Lecture 13 – Snooping Cache and Directory Based Multiprocessors

Upgrade to remove ads

This preview shows page 1-2-3-4-5-6 out of 18 pages.

Save
View Full Document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience

Upgrade to remove ads
Unformatted text preview:

EECS 252 Graduate Computer ArchitectureLec 13 – Snooping Cache and Directory Based MultiprocessorsDavid PattersonElectrical Engineering and Computer SciencesUniversity of California, Berkeleyhttp://www.eecs.berkeley.edu/~pattrsnhttp://vlsi.cs.berkeley.edu/cs252-s06 3/3/2006 CS252 s06 snooping cache MP 2Review• 1 instruction operates on vectors of data• Vector loads get data from memory into big register files, operate, and then vector store• E.g., Indexed load, store for sparse matrix• Easy to add vector to commodity instruction set– E.g., Morph SIMD into vector• Vector is very effecient architecture for vectorizable codes, including multimedia and many scientific codes• “End” of uniprocessors speedup => Multiprocessors• Parallelism challenges: % parallalizable, long latency to remote memory• Centralized vs. distributed memory– Small MP vs. lower latency, larger BW for Larger MP• Message Passing vs. Shared Address– Uniform access time vs. Non-uniform access time3/3/2006 CS252 s06 snooping cache MP 3Outline• Review• Coherence• Write Consistency• Administrivia• Snooping• Building Blocks• Snooping protocols and examples• Coherence traffic and Performance on MP• Directory-based protocols and examples (if get this far)• Conclusion3/3/2006 CS252 s06 snooping cache MP 4Challenges of Parallel Processing1. Application parallelism ⇒ primarily via new algorithms that have better parallel performance2. Long remote latency impact ⇒ both by architect and by the programmer • For example, reduce frequency of remote accesses either by – Caching shared data (HW) – Restructuring the data layout to make more accesses local (SW)• Today’s lecture on HW to help latency via caches3/3/2006 CS252 s06 snooping cache MP 5Symmetric Shared-Memory Architectures• From multiple boards on a shared bus to multiple processors inside a single chip• Caches both– Private data are used by a single processor– Shared dataare used by multiple processors• Caching shared data ⇒ reduces latency to shared data, memory bandwidth for shared data,and interconnect bandwidth⇒ cache coherence problem3/3/2006 CS252 s06 snooping cache MP 6Example Cache Coherence Problem– Processors see different values for u after event 3– With write back caches, value written back to memory depends on happenstance of which cache flushes or writes back value when» Processes accessing main memory may see very stale value– Unacceptable for programming, and its frequent!I/O devicesMemoryP1$ $$P2P35u = ?4u = ?u:51u:52u:53u= 73/3/2006 CS252 s06 snooping cache MP 7Example• Intuition not guaranteed by coherence• expect memory to respect order between accesses to different locations issued by a given process– to preserve orders among accesses to same location by different processes• Coherence is not enough!– pertains only to single locationP1P2/*Assume initial value of A and flag is 0*/A = 1; while (flag == 0); /*spin idly*/flag = 1; print A;MemP1PnConceptual Picture3/3/2006 CS252 s06 snooping cache MP 8PDiskMemoryL2L1100:34100:35100:67Intuitive Memory Model• Too vague and simplistic; 2 issues1. Coherencedefines values returned by a read2. Consistencydetermines when a written value will be returned by a read• Coherence defines behavior to same location, Consistency defines behavior to other locations• Reading an address should return the last value written to that address– Easy in uniprocessors, except for I/O3/3/2006 CS252 s06 snooping cache MP 9Defining Coherent Memory System1. Preserve Program Order: A read by processor P to location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always returns the value written by P 2. Coherent view of memory: Read by a processor to location X that follows a write by another processor to X returns the written value if the read and write are sufficiently separated in time and no other writes to X occur between the two accesses 3. Write serialization: 2 writes to same location by any 2 processors are seen in the same order by all processors – If not, a processor could keep value 1 since saw as last write– For example, if the values 1 and then 2 are written to a location, processors can never read the value of the location as 2 and then later read it as 13/3/2006 CS252 s06 snooping cache MP 10Write Consistency• For now assume1. A write does not complete (and allow the next write to occur) until all processors have seen the effect of that write2. The processor does not change the order of any write with respect to any other memory access⇒ if a processor writes location A followed by location B, any processor that sees the new value of B must also see the new value of A • These restrictions allow the processor to reorder reads, but forces the processor to finish writes in program order3/3/2006 CS252 s06 snooping cache MP 11Basic Schemes for Enforcing Coherence• Program on multiple processors will normally have copies of the same data in several caches– Unlike I/O, where its rare• Rather than trying to avoid sharing in SW, SMPs use a HW protocol to maintain coherent caches– Migration and Replication key to performance of shared data• Migration - data can be moved to a local cache and used there in a transparent fashion – Reduces both latency to access shared data that is allocated remotely and bandwidth demand on the shared memory• Replication – for reading shared data simultaneously, since caches make a copy of data in local cache– Reduces both latency of access and contention for read shared data3/3/2006 CS252 s06 snooping cache MP 12CS 252 Administrivia• Monday March 20 Quiz 5-8 PM 405 Soda• Due Friday: Problem Set and Comments on 2 papers– Problem Set Assignment done in pairs– Gene Amdahl, "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities", AFIPS Conference Proceedings, (30), pp. 483-485, 1967.– Lorin Hochstein et al "Parallel Programmer Productivity: A Case Study of Novice Parallel Programmers." International Conference for High Performance Computing, Networking and Storage (SC'05). November 2005 • Be sure to comment– Amdahl: How long is paper? How much of it is Amdahl’s Law? What other comments about parallelism besides Amdahl’s Law?– Hochstein: What programming styles investigated? What was methodology? How would you redesign the experiment they did?


Download Lecture 13 – Snooping Cache and Directory Based Multiprocessors
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 13 – Snooping Cache and Directory Based Multiprocessors and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 13 – Snooping Cache and Directory Based Multiprocessors 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?