Unformatted text preview:

Cache Coherence in Bus Based Shared Memory Multiprocessors Shared Memory Multiprocessors Variations Cache Coherence in Shared Memory Multiprocessors A Coherent Memory System Intuition Formal Definition of Coherence Cache Coherence Approaches Bus Snooping Cache Coherence Protocols Write invalidate Bus Snooping Protocol For Write Through Caches Write invalidate Bus Snooping Protocol For Write Back Caches MSI Write Back Invalidate Protocol MESI Write Back Invalidate Protocol Write update Bus Snooping Protocol For Write Back Caches Dragon Write back Update Protocol PCA Chapter 5 EECC756 Shaaban 1 lec 10 Spring2008 5 6 2008 Shared Memory Multiprocessors Direct support in hardware of shared address space SAS parallel programming model address translation and protection in hardware hardware SAS Any processor can directly reference any memory location Communication occurs implicitly as result of loads and stores Normal uniprocessor mechanisms used to access data loads and stores synchronization Key is extension of memory hierarchy to support multiple processors Extended memory hierarchy Memory may be physically distributed among processors Caches in the extended memory hierarchy may have multiple inconsistent copies of the same data leading to data consistency or cache coherence problem that have to addressed by hardware architecture EECC756 Shaaban 2 lec 10 Spring2008 5 6 2008 Shared Memory Multiprocessors Support of Programming Models Programming models Message passing Compilation or library Shared address space Multiprogramming Operating systems support Communication abstraction User system boundary Hardware software boundary Communication hardware Physical communication medium Address translation and protection in hardware hardware SAS Message passing using shared memory buffers Can offer very high performance since no OS involvement necessary The focus here is on supporting a consistent or coherent shared address space EECC756 Shaaban 3 lec 10 Spring2008 5 6 2008 Shared Memory Multiprocessors Variations Uniform Memory Access UMA Multiprocessors All processors have equal access to all memory addresses Can be further divided into three types Bus based shared memory multiprocessors Symmetric Memory Multiprocessors SMPs Shared cache multiprocessors Dancehall multiprocessors Non uniform Memory Access NUMA or distributed memory Multiprocessors Shared memory is physically distributed locally among processors nodes Access to remote memory is higher Most popular design to build scalable systems MPPs Cache coherence achieved by directory based methods EECC756 Shaaban 4 lec 10 Spring2008 5 6 2008 Shared Memory Multiprocessors Variations P1 Pn Symmetric Memory Multiprocessors SMPs Switch P1 UMA Interleaved interleaved First level Second level e g CMPs Pn UMA Bus or point to point Bus interconnects Interleaved Main memory I O devices Mem a Shared cache b Bus based shar ed memory Or SMP nodes UMA P1 Pn P1 Mem Scalable Distributed Shared Memory Mem Pn Interconnection network Interconnection network Mem Mem c Dancehall Scalable network p to p or MIN d Distributed memory NUMA EECC756 Shaaban 5 lec 10 Spring2008 5 6 2008 Uniform Memory Access UMA Multiprocessors Bus based Multiprocessors SMPs A number of processors commonly 2 4 in a single node share physical memory via system bus or pointto point interconnects e g AMD64 via HyperTransport Symmetric access to all of main memory from any processor Commonly called Symmetric Memory Multiprocessors SMPs Building blocks for larger parallel systems MPPs clusters Also attractive for high throughput servers Bus snooping mechanisms used to address the cache coherency problem Shared cache Multiprocessor Systems Low latency sharing and prefetching across processors Sharing of working sets No cache coherence problem and hence no false sharing either But high bandwidth needs and negative interference e g conflicts Hit and miss latency increased due to intervening switch and cache size Used in mid 80s to connect a few of processors on a board Encore Sequent Used currently in chip multiprocessors CMPs 2 4 processors on a single chip 5 two processor cores on a chip shared L2 e g IBM Power 4 Dancehall No local memory associated with a node Not a popular design All memory is uniformly costly to access over the network for all processors EECC756 Shaaban 6 lec 10 Spring2008 5 6 2008 Uniform Memory Access Example Intel Pentium Pro Quad Circa 1997 CPU P Pro module 256 KB L2 Interrupt controller Bus interface Shared FSB P Pro module PCI bridge PCI bridge Memory controller PCI bus PCI bus P Pro bus 64 bit data 36 bit addr ess 66 MHz PCI I O cards MIU 1 2 or 4 way interleaved DRAM All coherence and multiprocessing glue in processor module Highly integrated targeted at high volume Bus Based Symmetric Memory Processors SMPs A single Front Side Bus FSB is shared among processors This severely limits scalability to only 2 4 processors P Pro module Repeated here from lecture 1 EECC756 Shaaban 7 lec 10 Spring2008 5 6 2008 Non Uniform Memory Access NUMA Example AMD 8 way Opteron Server Node Circa 2003 Dedicated point to point interconnects Coherent HyperTransport links used to connect processors alleviating the traditional limitations of FSB based SMP systems yet still providing the cache coherency support needed Each processor has two integrated DDR memory channel controllers memory bandwidth scales up with number of processors NUMA architecture since a processor can access its own memory at a lower latency than access to remote memory directly connected to other processors in the system Total 16 processor cores when dual core Opteron processors used Repeated here from lecture 1 EECC756 Shaaban 8 lec 10 Spring2008 5 6 2008 Chip Multiprocessor Shared Cache Example CMP IBM Power 4 Two Processor cores on a chip share level 2 cache EECC756 Shaaban 9 lec 10 Spring2008 5 6 2008 Complexities of MIMD Shared Memory Access Relative order interleaving of instructions in different streams in not fixed With no synchronization among instructions streams number of instruction interleavings is possible a large If instructions are reordered in a stream then an even larger of number of instruction interleavings is possible i e Effect of access not multiple visible to memory If memory accesses are not atomic with copies of the all processors in the same order same data coexisting cache based systems then different processors observe different interleavings during the same execution The total


View Full Document

RIT EECC 756 - Study Notes

Documents in this Course
Load more
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?