3 3 11 CS 61C Great Ideas in Computer Architecture Machine Structures SIMD II Instructors Randy H Katz David A PaFerson hFp inst eecs Berkeley edu cs61c sp11 3 3 11 Spring 2011 Lecture 14 1 3 3 11 Spring 2011 Lecture 14 New School Machine Structures It s a bit more complicated Review So ware Hardware Parallel Requests Assigned to computer e g Search Katz Parallel Threads Assigned to core e g Lookup Ads Smart Phone Warehouse Scale Computer Harness Parallelism Achieve High Performance Computer Parallel Instruc ons 1 instruc on one me e g 5 pipelined instruc ons Parallel Data 1 data item one me e g Add of 4 pairs of words Hardware descrip ons All gates one me 3 3 11 Core Today s Instruc on Unit s Lecture SIMD Single Instruc on Mul ple Data MIMD Mul ple Instruc on Mul ple Data SISD Single Instruc on Single Data unused MISD Mul ple Instruc on Single Data One instruc on fetch that operates on mul ple operands simultaneously 128 64 bit XMM registers Core Func onal Unit s SSE Instruc ons in C Embed the SSE machine instruc ons directly into C programs through use of intrinsics Achieve e ciency beyond that of op mizing compiler A0 B0 A1 B1 A2 B2 A3 B3 Main Memory Logic Gates Spring 2011 Lecture 14 3 Agenda Flynn Taxonomy of Parallel Architectures Intel SSE SIMD Instruc ons Core Memory Cache Input Output 2 3 3 11 Spring 2011 Lecture 14 4 Big Idea Amdahl s Heartbreaking Law Speedup due to enhancement E is Amdahl s Law Administrivia SIMD and Loop Unrolling Technology Break Memory Performance for Caches Review of 1st Half of 61C Exec me w o E Speedup w E Exec me w E Suppose that enhancement E accelerates a frac on F F 1 of the task by a factor S S 1 and the remainder of the task is una ected Execu on Time w E Execu on Time w o E 1 F F S Speedup w E 1 1 F F S 3 3 11 Spring 2011 Lecture 14 5 3 3 11 Fall 2010 Lecture 17 6 1 3 3 11 Big Idea Amdahl s Law Big Idea Amdahl s Law Speedup Speedup 1 1 F F Non speed up part S Example the execu on me of half of the program can be accelerated by a factor of 2 What is the program speed up overall Example the execu on me of half of the program can be accelerated by a factor of 2 What is the program speed up overall 1 0 5 0 5 2 3 3 11 Fall 2010 Lecture 17 7 The non parallel por on limits the performance Fall 2010 Lecture 17 8 9 What if its usable only 15 of the me Speedup w E Amdahl s Law tells us that to achieve linear speedup with 100 processors none of the original computa on can be scalar To get a speedup of 90 from 100 processors the percentage of the original program that could be scalar would have to be 0 1 or less Speedup w E 3 3 11 Fall 2010 Lecture 17 10 Parallel Speed up Example Speedup w E 1 1 F F S Consider an enhancement which runs 20 mes faster but which is only usable 25 of the me Speedup w E 1 75 25 20 1 31 Z0 Z1 Z10 X1 1 X1 10 Y1 1 Y1 10 Y10 1 Y10 10 X10 1 What if its usable only 15 of the me Speedup w E 1 85 15 20 1 17 Non parallel part Amdahl s Law tells us that to achieve linear speedup with 100 processors none of the original computa on can be scalar To get a speedup of 90 from 100 processors the percentage of the original program that could be scalar would have to be 0 1 or less Speedup w E 1 001 999 100 90 99 Fall 2010 Lecture 17 1 33 Consider an enhancement which runs 20 mes faster but which is only usable 25 of the me Speedup w E Example 1 Amdahl s Law 3 3 11 1 0 5 0 25 Speedup w E If the por on of the program that can be parallelized is small then the speedup is limited Fall 2010 Lecture 17 3 3 11 Example 1 Amdahl s Law Big Idea Amdahl s Law 3 3 11 Speed up part 11 X10 10 Par on 10 ways and perform on 10 parallel processing units Parallel part 10 scalar opera ons non parallelizable 100 parallelizable opera ons 110 opera ons 100 110 909 Parallelizable 10 110 0 91 Scalar 3 3 11 Fall 2010 Lecture 17 12 2 3 3 11 Example 2 Amdahl s Law Example 2 Amdahl s Law Speedup w E 1 1 F F S Speedup w E 1 1 F F S Consider summing 10 scalar variables and two 10 by 10 matrices matrix sum on 10 processors Consider summing 10 scalar variables and two 10 by 10 matrices matrix sum on 10 processors Speedup w E Speedup w E 1 091 909 10 1 0 1819 5 5 What if there are 100 processors What if there are 100 processors Speedup w E 1 091 909 100 1 0 10009 10 0 Speedup w E What if the matrices are 33 by 33 or 1019 adds in total on 10 processors increase parallel data by 10x What if the matrices are 100 by 100 or 10 010 adds in total on 10 processors Speedup w E 1 009 991 10 1 0 108 9 2 Speedup w E What if there are 100 processors What if there are 100 processors Speedup w E 1 009 991 100 1 0 019 52 6 Speedup w E 3 3 11 Fall 2010 Lecture 17 13 3 3 11 Fall 2010 Lecture 17 Strong and Weak Scaling Peer Review To get good speedup on a mul processor while keeping the problem size xed is harder than gexng good speedup by increasing the size of the problem Strong scaling when speedup can be achieved on a parallel processor without increasing the size of the problem e g 10x10 Matrix on 10 processors to 100 Weak scaling when speedup is achieved on a parallel processor by increasing the size of the problem propor onally to the increase in the number of processors e g 10x10 Matrix on 10 processors 33x33 Matrix on 100 Suppose a program spends 80 of its me in a square root rou ne How much must you speedup square root to make the program run 5 mes faster Speedup w E 1 1 F F S A red C green 10 20 Just 1 unit with twice the load of others cuts speedup almost in half Fall 2010 Lecture 17 15 Lab 7 posted No Homework no project this week TA Review Su Mar 6 2 5 PM 2050 VLSB Midterm Exam Tu Mar 8 6 9 PM 145 155 Dwinelle Split A Lew in 145 Li Z in …
View Full Document
Unlocking...