Carnegie Mellon 1 Thread'Level+Parallelism+15#213:'Introduc0on'to'Computer'Systems'26th'Lecture,'Nov.'30,'2010'Instructors:''Randy'Bryant'and'Dave'O’Hallaron'Carnegie Mellon 2 Today+ Parallel++Compu:ng+Hardware+ Mul0core' Mul0ple'separate'processors'on'single'chip' Hyperthreading' Replicated'instruc0on'execu0on'hardware'in'each'processor' Maintaining'cache'consistency' Thread+Level+Parallelism+ SpliMng'program'into'independent'tasks' Example:'Parallel'summa0on' Some'performance'ar0facts' Divide#and'conquer'parallelism' Example:'Parallel'quicksort'Carnegie Mellon 3 Mul:core+Processor+ Intel+Nehalem+Processor + E.g.,'Shark'machines' Mul0ple'processors'opera0ng'with'coherent'view'of'memory'Regs L1 d-cache L1 i-cache L2 unified cache Core 0 Regs L1 d-cache L1 i-cache L2 unified cache Core n-1 … L3 unified cache (shared by all cores) Main memoryCarnegie Mellon 4 Memory+Consistency+ What+are+the+possible+values+printed?+ Depends'on'memory'consistency'model' Abstract'model'of'how'hardware'handles'concurrent'accesses'' Sequen:al+consistency+ Overall'effect'consistent'with'each'individual'thread' Otherwise,'arbitrary'interleaving'int+a+=+1;+int+b+=+100;+Thread1:+Wa: +a+=+2;+Rb:+ +print(b);+Thread2:+Wb:+b+=+200;+Ra: +print(a);+Wa+Rb+Wb+Ra+Thread+consistency+constraints+Carnegie Mellon 5 Sequen:al+Consistency+Example+ Impossible+outputs+ 100,'1'and'1,'100' Would'require'reaching'both'Ra'and'Rb'before'Wa'and'Wb'Wa+Rb+Wb+Ra+Wb+Rb+Ra+Ra+Rb+Wb+Ra+Wa+Rb+Wa+Ra+Rb+Rb+Ra+100,+2+200,+2+2,+200+1,+200+2,+200+200,+2+Wa+Rb+Wb+Ra+Thread+consistency+constraints+int+a+=+1;+int+b+=+100;+Thread1:+Wa: +a+=+2;+Rb:+ +print(b);+Thread2:+Wb:+b+=+200;+Ra: +print(a);+Carnegie Mellon 6 Non'Coherent+Cache+Scenario+ Write'back+caches,+without+coordina:on+between+them+Main Memory a:1 b:100 Thread1 Cache a: 2 Thread2 Cache b:200 a:1 print+1+b:100 print+100+int+a+=+1;+int+b+=+100;+Thread1:+Wa: +a+=+2;+Rb:+ +print(b);+Thread2:+Wb:+b+=+200;+Ra: +print(a);+Carnegie Mellon 7 Snoopy+Caches+ Tag+each+cache+block+with+state+Invalid 'Cannot'use'value'Shared 'Readable'copy'Exclusive 'Writeable'copy'Main Memory a:1 b:100 Thread1 Cache Thread2 Cache a: 2 E b:200 E int+a+=+1;+int+b+=+100;+Thread1:+Wa: +a+=+2;+Rb:+ +print(b);+Thread2:+Wb:+b+=+200;+Ra: +print(a);+Carnegie Mellon 8 Snoopy+Caches+ Tag+each+cache+block+with+state+Invalid 'Cannot'use'value'Shared 'Readable'copy'Exclusive 'Writeable'copy'Main Memory a:1 b:100 Thread1 Cache Thread2 Cache a: 2 E b:200 E print+200+b:200 S b:200 S print+2+a:2 S a: 2 S int+a+=+1;+int+b+=+100;+Thread1:+Wa: +a+=+2;+Rb:+ +print(b);+Thread2:+Wb:+b+=+200;+Ra: +print(a);+ When'cache'sees'request'for'one'of'its'E#tagged'blocks' Supply'value'from'cache' Set'tag'to'S'Carnegie Mellon 9 Out'of'Order+Processor+Structure+ Instruc:on+control+dynamically+c onverts+program+into+stream+of+opera:ons+ Opera:ons+mapped+onto+func:onal+units+to+execute+in+parallel+Functional Units Integer Arith Integer Arith FP Arith Load / Store Instruction Control Registers Instruction Decoder Op. Queue Data Cache Instruction Cache PCCarnegie Mellon 10 Hyperthreading+ Replicate+enough+instruc:on+control+to+process+K+instruc:on+streams+ K+copies+of+all+registers+ Share+func:onal+units+Functional Units Integer Arith Integer Arith FP Arith Load / Store Instruction Control Reg B Instruction Decoder Op. Queue B Data Cache Instruction Cache Reg A Op. Queue A PC A PC BCarnegie Mellon 11 Summary:+Crea:ng+Parallel+Machines+ Mul:core+ Separate'instruc0on'logic'and'func0onal'units' Some'shared,'some'private'caches' Must'implement'cache'coherency' Hyperthreading++ Also'called'“simultaneous'mul0threading”' Separate'program'state' Shared'func0onal'units'&'caches' No'special'control'needed'for'coherency' Combining+ Shark'machines:'8'cores,'each'with'2#way'hyperthreading' Theore0cal'speedup'of'16X' Never'achieved'in'our'benchmarks'Carnegie Mellon 12 Summa:on+Example+ Sum+numbers+0,+…,+N'1+ Should'add'up'to'(N#1)*N/2' Par::on+into+K+ranges+ ⎣N/K⎦'values'each' Accumulate'lebover'values'serially' Method+#1:+All+threads+update+single+global+variable+ 1A:'No'synchroniza0on' 1B:'Synchronize'with'pthread'semaphore' 1C:'Synchronize'with'pthread'mutex' “Binary”'semaphore.''Only'values'0'&'1'Carnegie Mellon 13 Accumula:ng+in+Single+Global+Variable:+Declara:ons+typedef unsigned long data_t; /* Single accumulator */ volatile data_t global_sum; /* Mutex & semaphore for global sum */ sem_t semaphore; pthread_mutex_t mutex; /* Number of elements summed by each thread */ size_t nelems_per_thread; /* Keep track of thread IDs */ pthread_t tid[MAXTHREADS]; /* Identify each thread */ int myid[MAXTHREADS];Carnegie Mellon 14 Accumula:ng+in+Single+Global+Variable:+Opera:on+ nelems_per_thread = nelems / nthreads; /* Set global value */ global_sum = 0; /* Create threads and wait for them to finish */ for (i = 0; i < nthreads; i++) { myid[i] = i; Pthread_create(&tid[i], NULL, thread_fun, &myid[i]); } for (i = 0; i < nthreads; i++) Pthread_join(tid[i], NULL); result = global_sum; /* Add leftover elements */ for (e = nthreads * nelems_per_thread; e < nelems; e++) result += e;Carnegie Mellon 15 Thread+Func:on:+No+Synchroniza:on+void *sum_race(void *vargp) { int myid = *((int *)vargp); size_t start = myid * nelems_per_thread; size_t end = start + nelems_per_thread; size_t i; for (i = start; i < end; i++) { global_sum += i; } return NULL; }Carnegie Mellon 16 Unsynchronized+Performance+ N+=+230+ Best+speedup+=+2.86X+ Gets+wrong+answer+when+>+1+thread!+Carnegie Mellon 17 Thread+Func:on:+Semaphore+/+Mutex+void *sum_sem(void *vargp) { int myid = *((int *)vargp); size_t start = myid * nelems_per_thread; size_t end = start + nelems_per_thread; size_t i; for (i = start; i < end; i++) { sem_wait(&semaphore); global_sum += i; sem_post(&semaphore); } return NULL; } sem_wait(&semaphore); global_sum += i; sem_post(&semaphore);
View Full Document