CS 152 Computer Architecture and Engineering Lecture 23 Synchronization 2006 11 16 John Lazzaro www cs berkeley edu lazzaro TAs Udam Saini and Jue Sun www inst eecs berkeley edu cs152 CS 152 L23 Synchronization Tuesday March 4 14 UC Regents Fall 2006 UCB 1 Last Time NVidia 8800 a unified GPU 128 Shader CPUs Thread processor sets shader type of each CPU Streams loop around 1 35 GHz Shader CPU Clock 575 MHz core clock CS 152 L22 Graphics Processors Tuesday March 4 14 UC Regents Fall 2006 UCB 2 In earlier lectures we pretended it was easy to let several CPUs share a memory system In fact it is an architectural challenge Even letting several threads on one machine share memory is tricky CS 152 L23 Synchronization Tuesday March 4 14 supports a 1 875 Mbyte on chip L2 cache Power4 and Power4 systems both have 32Mbyte L3 caches whereas Power5 systems have a 36 Mbyte L3 cache The L3 cache operates as a backdoor with separate buses for reads and writes that oper Figure 2 Power5 chip FXU fixed point execution unit ISU instruction sequencing unit IDU instruction decode unit LSU load store unit IFU instruction fetch unit FPU floating point unit and MC memory controller We d support threade shows which is Process ing pat uses eig The PowerP two ide two log the chip tiproces cores sh cache identic each T with 51 The dat slice the can ind We al chip 36 Having the pro L2 miss To red the me inates d nal con Recall Two CPUs sharing memory UC Regents Fall 2006 UCB 3 Today Hardware Thread Support Producer Consumer One thread writes A one thread reads A Locks Two threads share write access to A On Tuesday Multiprocessor memory system design and synchronization issues Tuesday is a simplified overview graduate level architecture courses spend weeks on this topic CS 152 L23 Synchronization Tuesday March 4 14 UC Regents Fall 2006 UCB 4 How 2 threads share a queue We begin with an empty queue Tail Head Words in Memory Higher Address Numbers Thread 1 T1 adds data to the tail of the queue Producer thread Thread 2 T2 takes data from the head of the queue Consumer thread CS 152 L23 Synchronization Tuesday March 4 14 UC Regents Fall 2006 UCB 5 Producer adding x to the queue Tail Head Words in Memory Before Higher Address Numbers T1 code producer ORI R1 R0 xval LW R2 tail R0 SW R1 0 R2 ADDI R2 R2 4 SW R2 0 tail Tail After CS 152 L23 Synchronization Tuesday March 4 14 Load x value into R1 Load tail pointer into R2 Store x into queue Shift tail by one word Update tail memory addr Head x Higher Address Numbers Words in Memory UC Regents Fall 2006 UCB 6 Producer adding y to the queue Tail Head Before Words in Memory x Higher Address Numbers ORI R1 R0 yval LW R2 tail R0 SW R1 0 R2 ADDI R2 R2 4 SW R2 0 tail T1 code producer Tail After CS 152 L23 Synchronization Tuesday March 4 14 Load y value into R1 Load tail pointer into R2 Store y into queue Shift tail by one word Update tail memory addr Head y x Higher Address Numbers Words in Memory UC Regents Fall 2006 UCB 7 Consumer reading the queue Tail Before Head y x LW R3 head R0 spin LW R4 tail R0 BEQ R4 R3 spin LW R5 0 R3 T2 code consumer ADDI R3 R3 4 SW R3 head R0 Tail After CS 152 L23 Synchronization Tuesday March 4 14 Words in Memory Load head pointer into R3 Load tail pointer into R4 If queue empty wait Read x from queue into R5 Shift head by one word Update head pointer Head y Higher Address Numbers Words in Memory UC Regents Fall 2006 UCB 8 What can go wrong single threaded LW SW contract Tail Produce Head x Tail Head Consume Higher Addresses Load x value into R1 Load tail pointer into R2 Store x into queue Shift tail by one word Update tail pointer LW R3 head R0 spin LW R4 tail R0 3 BEQ R4 R3 spin LW R5 0 R3 T2 code 4 consumer ADDI R3 R3 4 SW R3 head R0 Load head pointer into R3 Load tail pointer into R4 If queue empty wait Read x from queue into R5 Shift head by one word Update head pointer T1 code producer ORI R1 R0 x LW R2 tail R0 SW R1 0 R2 1 ADDI R2 R2 4 SW R2 0 tail 2 Higher Addresses What if order is 2 3 4 1 Then x is read before it is written The CPU running T1 has no way to know its bad to delay 1 CS 152 L23 Synchronization Tuesday March 4 14 UC Regents Fall 2006 UCB 9 Leslie Lamport Sequential Consistency Sequential Consistency As if each thread takes turns executing and instructions in each thread execute in program order T1 code producer ORI R1 R0 x LW R2 tail R0 SW R1 0 R2 1 ADDI R2 R2 4 SW R2 0 tail 2 LW R3 head R0 spin LW R4 tail R0 3 BEQ R4 R3 spin LW R5 0 R3 T2 code 4 consumer ADDI R3 R3 4 SW R3 head R0 Load x value into R1 Load queue tail into R2 Store x into queue Shift tail by one word Update tail memory addr Load queue head into R3 Load queue tail into R4 If queue empty wait Read x from queue into R5 Shift head by one word Update head memory addr Sequentially Consistent 1 2 3 4 or 1 3 2 4 but not 2 3 1 4 or 2 3 4 1 Sequential Consistent architectures get the right answer but give up many optimizations CS 152 L23 Synchronization Tuesday March 4 14 UC Regents Fall 2006 UCB 10 Efficient alternative Memory barriers In the general case machine is not sequentially consistent When needed a memory barrier may be added to the program a fence ORI R1 R0 x LW R2 tail R0 SW R1 0 R2 MEMBAR ADDI R2 R2 4 SW R2 0 tail 1 2 Ensures 1 completes before 2 takes effect MEMBAR is expensive but you only pay for it when you use it Many MEMBAR variations for efficiency versions that only effect loads or stores certain memory regions etc All memory operations before fence complete then memory operations after the fence begin CS 152 L23 Synchronization Tuesday March 4 14 UC Regents Fall 2006 UCB 11 Producer consumer memory fences Tail Produce Head x Tail Head Consume Higher Addresses Load x value into R1 Load queue tail into R2 Store x into queue LW R3 head R0 spin LW R4 tail R0 3 T2 code BEQ R4 R3 spin consumer MEMBAR LW R5 0 R3 4 ADDI R3 R3 4 SW R3 head R0 Load queue head into R3 Load queue tail into R4 If queue empty wait T1 code producer ORI R1 R0 x LW R2 tail R0 SW R1 0 R2 …
View Full Document
Unlocking...