Slide 1TodayMulticore ProcessorMemory ConsistencySequential Consistency ExampleNon-Coherent Cache ScenarioSnoopy CachesSnoopy CachesOut-of-Order Processor StructureHyperthreadingSummary: Creating Parallel MachinesSummation ExampleAccumulating in Single Global Variable: DeclarationsAccumulating in Single Global Variable: OperationThread Function: No SynchronizationUnsynchronized PerformanceThread Function: Semaphore / MutexSemaphore / Mutex PerformanceSeparate AccumulationSeparate Accumulation: OperationThread Function: Memory AccumulationMemory Accumulation PerformanceFalse SharingFalse Sharing PerformanceThread Function: Register AccumulationRegister Accumulation PerformanceA More Interesting ExampleSequential Quicksort VisualizedSequential Quicksort VisualizedSequential Quicksort CodeParallel QuicksortParallel Quicksort VisualizedParallel Quicksort Data StructuresParallel Quicksort InitializationParallel Quicksort: Accessing Task QueueParallel Quicksort: Top-Level FunctionParallel Quicksort: Recursive functionParallel Quicksort: Sorting Task FunctionParallel Quicksort PerformanceParallel Quicksort PerformanceImplementation SubtletiesAmdahl’s LawAmdahl’s Law ExampleAmdahl’s Law & Parallel QuicksortLessons LearnedCarnegie Mellon1Thread-Level Parallelism15-213: Introduction to Computer Systems26th Lecture, Nov. 30, 2010Instructors: Randy Bryant and Dave O’HallaronCarnegie Mellon2TodayParallel Computing HardwareMulticoreMultiple separate processors on single chipHyperthreadingReplicated instruction execution hardware in each processorMaintaining cache consistencyThread Level ParallelismSplitting program into independent tasksExample: Parallel summationSome performance artifactsDivide-and conquer parallelismExample: Parallel quicksortCarnegie Mellon3Multicore ProcessorIntel Nehalem ProcessorE.g., Shark machinesMultiple processors operating with coherent view of memoryRegsL1 d-cacheL1 i-cacheL2 unified cacheCore 0RegsL1 d-cacheL1 i-cacheL2 unified cacheCore n-1…L3 unified cache(shared by all cores)Main memoryCarnegie Mellon4Memory ConsistencyWhat are the possible values printed?Depends on memory consistency modelAbstract model of how hardware handles concurrent accesses Sequential consistencyOverall effect consistent with each individual threadOtherwise, arbitrary interleavingint a = 1;int b = 100;Thread1:Wa: a = 2;Rb: print(b);Thread2:Wb: b = 200;Ra: print(a);WaRbWbRaThread consistencyconstraintsCarnegie Mellon5Sequential Consistency ExampleImpossible outputs100, 1 and 1, 100Would require reaching both Ra and Rb before Wa and WbWaRbWbRaWbRbRaRaRbWbRaWaRbWaRaRbRbRa100, 2200, 22, 2001, 2002, 200200, 2WaRbWbRaThread consistencyconstraintsint a = 1;int b = 100;Thread1:Wa: a = 2;Rb: print(b);Thread2:Wb: b = 200;Ra: print(a);Carnegie Mellon6Non-Coherent Cache ScenarioWrite-back caches, without coordination between themMain Memorya:1 b:100Thread1 Cachea: 2Thread2 Cacheb:200a:1print 1b:100print 100int a = 1;int b = 100;Thread1:Wa: a = 2;Rb: print(b);Thread2:Wb: b = 200;Ra: print(a);Carnegie Mellon7Snoopy CachesTag each cache block with stateInvalid Cannot use valueSharedReadable copyExclusive Writeable copyMain Memorya:1 b:100Thread1 Cache Thread2 Cachea: 2Eb:200Eint a = 1;int b = 100;Thread1:Wa: a = 2;Rb: print(b);Thread2:Wb: b = 200;Ra: print(a);Carnegie Mellon8Snoopy CachesTag each cache block with stateInvalid Cannot use valueSharedReadable copyExclusive Writeable copyMain Memorya:1 b:100Thread1 Cache Thread2 Cachea: 2Eb:200Eprint 200b:200Sb:200Sprint 2a:2Sa: 2Sint a = 1;int b = 100;Thread1:Wa: a = 2;Rb: print(b);Thread2:Wb: b = 200;Ra: print(a);When cache sees request for one of its E-tagged blocksSupply value from cacheSet tag to SCarnegie Mellon9Out-of-Order Processor StructureInstruction control dynamically converts program into stream of operationsOperations mapped onto functional units to execute in parallelFunctional UnitsIntegerArithIntegerArithFPArithLoad /StoreInstruction ControlRegistersInstruction DecoderOp. QueueData CacheInstructionCachePCCarnegie Mellon10HyperthreadingReplicate enough instruction control to process K instruction streamsK copies of all registersShare functional unitsFunctional UnitsIntegerArithIntegerArithFPArithLoad /StoreInstruction ControlReg BInstruction DecoderOp. Queue BData CacheInstructionCacheReg A Op. Queue APC APC BCarnegie Mellon11Summary: Creating Parallel MachinesMulticoreSeparate instruction logic and functional unitsSome shared, some private cachesMust implement cache coherencyHyperthreading Also called “simultaneous multithreading”Separate program stateShared functional units & cachesNo special control needed for coherencyCombiningShark machines: 8 cores, each with 2-way hyperthreadingTheoretical speedup of 16XNever achieved in our benchmarksCarnegie Mellon12Summation ExampleSum numbers 0, …, N-1Should add up to (N-1)*N/2Partition into K rangesN/K values eachAccumulate leftover values seriallyMethod #1: All threads update single global variable1A: No synchronization1B: Synchronize with pthread semaphore1C: Synchronize with pthread mutex“Binary” semaphore. Only values 0 & 1Carnegie Mellon13Accumulating in Single Global Variable: Declarationstypedef unsigned long data_t;/* Single accumulator */volatile data_t global_sum;/* Mutex & semaphore for global sum */sem_t semaphore;pthread_mutex_t mutex;/* Number of elements summed by each thread */size_t nelems_per_thread;/* Keep track of thread IDs */pthread_t tid[MAXTHREADS];/* Identify each thread */int myid[MAXTHREADS];Carnegie Mellon14Accumulating in Single Global Variable: Operation nelems_per_thread = nelems / nthreads; /* Set global value */ global_sum = 0; /* Create threads and wait for them to finish */ for (i = 0; i < nthreads; i++) {myid[i] = i;Pthread_create(&tid[i], NULL, thread_fun, &myid[i]); } for (i = 0; i < nthreads; i++) Pthread_join(tid[i], NULL); result = global_sum; /* Add leftover elements */ for (e = nthreads * nelems_per_thread; e < nelems; e++) result += e;Carnegie Mellon15Thread Function: No Synchronizationvoid *sum_race(void *vargp) { int myid = *((int *)vargp); size_t start = myid * nelems_per_thread;
View Full Document