CS 152 Computer Architecture and Engineering Lecture 23 Synchronization 2005 11 17 John Lazzaro www cs berkeley edu lazzaro TAs David Marquardt and Udam Saini www inst eecs berkeley edu cs152 CS 152 L23 Synchronization UC Regents Fall 2005 UCB Last Time How Routers Work 238 2 Forwarding engine determines the next hop for the packet and returns next hop data to the line card together with an updated header IEEE ACM TRANSACTIONS ON NETWORKING VOL 6 NO 3 JUN 2 2 CS 152 L23 Synchronization UC Regents Fall 2005 UCB In earlier lectures we pretended it was easy to let several CPUs share a memory system In fact it is an architectural challenge Even letting several threads on one machine share memory is tricky CS 152 L23 Synchronization supports a 1 875 Mbyte on chip L2 cache Power4 and Power4 systems both have 32Mbyte L3 caches whereas Power5 systems have a 36 Mbyte L3 cache The L3 cache operates as a backdoor with separate buses for reads and writes that oper Figure 2 Power5 chip FXU fixed point execution unit ISU instruction sequencing unit IDU instruction decode unit LSU load store unit IFU instruction fetch unit FPU floating point unit and MC memory controller ing pat uses eig The PowerP two ide two log the chi tiproces cores sh cache identic each T with 51 The da slice the can ind We al chip 36 Having the pro L2 miss To red the me inates d nal con We d suppor threade shows which i Process Recall Two CPUs sharing memory UC Regents Fall 2005 UCB Today Hardware Thread Support Producer Consumer One thread writes A one thread reads A Locks Two threads share write access to A On Tuesday Multiprocessor memory system design and synchronization issues Tuesday is a simplified over view graduate level architecture courses spend weeks on this topic CS 152 L23 Synchronization UC Regents Fall 2005 UCB How 2 threads share a queue We begin with an empty queue Tail Head Words in Memory Higher Address Numbers Thread 1 T1 adds data to the tail of the queue Producer thread Thread 2 T2 takes data from the head of the queue Consumer thread CS 152 L23 Synchronization UC Regents Fall 2005 UCB Producer adding x to the queue Tail Head Words in Memory Before Higher Address Numbers T1 code producer ORi R1 R0 xval LW R2 tail R0 SW R1 0 R2 ADDi R2 R2 4 SW R2 0 tail Tail After CS 152 L23 Synchronization Load x value into R1 Load tail pointer into R2 Store x into queue Shift tail by one word Update tail memory addr Head x Higher Address Numbers Words in Memory UC Regents Fall 2005 UCB Producer adding y to the queue Tail Head Before Words in Memory x Higher Address Numbers ORi R1 R0 yval LW R2 tail R0 SW R1 0 R2 ADDi R2 R2 4 SW R2 0 tail T1 code producer Tail After CS 152 L23 Synchronization Load y value into R1 Load tail pointer into R2 Store y into queue Shift tail by one word Update tail memory addr Head y x Higher Address Numbers Words in Memory UC Regents Fall 2005 UCB Consumer reading the queue Tail Before Head y x LW R3 head R0 spin LW R4 tail R0 BEQ R4 R3 spin LW R5 0 R3 T2 code consumer ADDi R3 R3 4 SW R3 head R0 Tail After CS 152 L23 Synchronization Words in Memory Load head pointer into R3 Load tail pointer into R4 If queue empty wait Read x from queue into R5 Shift head by one word Update head pointer Head y Higher Address Numbers Words in Memory UC Regents Fall 2005 UCB What can go wrong Tail Before Head y x Tail After Higher Addresses y Higher Addresses Load x value into R1 Load tail pointer into R2 Store x into queue Shift tail by one word Update tail pointer LW R3 head R0 spin LW R4 tail R0 3 BEQ R4 R3 spin LW R5 0 R3 T2 code 4 consumer ADDi R3 R3 4 SW R3 head R0 Load head pointer into R3 Load tail pointer into R4 If queue empty wait Read x from queue into R5 Shift head by one word Update head pointer T1 code producer ORi R1 R0 x LW R2 tail R0 SW R1 0 R2 1 ADDi R2 R2 4 SW R2 0 tail 2 Head What if order is 2 3 4 1 Then x is read before it is written The CPU running T1 has no way to know its bad to delay 1 CS 152 L23 Synchronization UC Regents Fall 2005 UCB Leslie Lamport Sequential Consistency Sequential Consistency As if each thread takes turns executing and instructions in each thread execute in program order T1 code producer ORi R1 R0 x LW R2 tail R0 SW R1 0 R2 1 ADDi R2 R2 4 SW R2 0 tail 2 LW R3 head R0 spin LW R4 tail R0 3 BEQ R4 R3 spin LW R5 0 R3 T2 code 4 consumer ADDi R3 R3 4 SW R3 head R0 Load x value into R1 Load queue tail into R2 Store x into queue Shift tail by one word Update tail memory addr Load queue head into R3 Load queue tail into R4 If queue empty wait Read x from queue into R5 Shift head by one word Update head memory addr Legal orders 1 2 3 4 or 1 3 2 4 or 3 4 1 2 but not 2 3 1 4 Sequential Consistent architectures get the right answer but give up many optimizations CS 152 L23 Synchronization UC Regents Fall 2005 UCB Efficient alternative Memory barriers In the general case machine is not sequentially consistent When needed a memory barrier may be added to the program a fence ORi R1 R0 x LW R2 tail R0 SW R1 0 R2 MEMBAR ADDi R2 R2 4 SW R2 0 tail 1 2 Ensures 1 completes before 2 takes effect MEMBAR is expensive but you only pay for it when you use it Many MEMBAR variations for efficiency versions that only effect loads or stores certain memory regions etc All memory operations before fence complete then memory operations after the fence begin CS 152 L23 Synchronization UC Regents Fall 2005 UCB Producer consumer memory fences Tail Before Head y x Tail After Higher Addresses y Higher Addresses Load x value into R1 Load queue tail into R2 Store x into queue LW R3 head R0 spin LW R4 tail R0 3 T2 code BEQ R4 R3 spin consumer MEMBAR LW R5 0 R3 4 ADDi R3 R3 4 SW R3 head R0 Load queue head into R3 Load queue tail into R4 If queue empty wait T1 code producer ORi R1 R0 x LW R2 tail R0 SW R1 0 R2 1 MEMBAR ADDi R2 R2 4 SW R2 0 tail 2 Head Shift tail by one word Update tail memory addr Read x from queue into R5 Shift head by one word Update head memory addr Ensures 1 happens before 2 and 3 happens …
View Full Document
Unlocking...