3/3/11$1$CS$61C:$Great$Ideas$in$Computer$Architecture$(Machine$Structures)$SIMD%II%Instructors:$Randy$H.$Katz$David$A.$PaFerson$hFp://inst.eecs.Berkeley.edu/~cs61c/sp11$1$Spring$2011$NN$Lecture$#14$3/3/11$3/3/11$ Spring$2011$NN$Lecture$#14$ 2$NewNSchool$Machine$Structures$(It’s$a$bit$more$complicated!)$• Parallel$Requests$Assigned$to$computer$e.g.,$Search$“Katz”$• Parallel$Threads$Assigned$to$core$e.g.,$Lookup,$Ads$• Parallel$Instruc\ons$>1$instruc\on$@$one$\me$e.g.,$5$pipelined$instruc\ons$• Parallel$Data$>1$data$item$@$one$\me$e.g.,$Add$of$4$pairs$of$words$• Hardware$descrip\ons$All$gates$@$one$\me$3/3/11$ Spring$2011$NN$Lecture$#14$ 3$Smart$Phone$Warehouse$Scale$Computer$So'ware%%%%%%%%Hardware%Harness%Parallelism%&%Achieve%High%Performance%Logic$Gates$Core$ Core$…$$$$$$Memory$$$$$$$$$$$$$$$(Cache)$Input/Output$Computer$Main$Memory$Core$$$$$$$$$$Instruc\on$Unit(s)$$$$$$$$Func\onal$Unit(s)$A3+B3$A2+B2$A1+B1$A0+B0$Today ’s $Lecture$Review$• Flynn$Taxonomy$of$Parallel$Architectures$– SIMD:%Single%Instruc>on%Mul>ple%Data%– MIMD:%Mul>ple%Instruc>on%Mul>ple%Data%– SISD:$Single$Instruc\on$Single$Data$(unused)$– MISD:$Mul\ple$Instruc\on$Single$Data$• Intel$SSE$SIMD$Instruc\ons$– One$instruc\on$fetch$that$operates$on$mul\ple$operands$simultaneously$– 128/64$bit$XMM$registers$• SSE$Instruc\ons$in$C$– Embed$the$SSE$machine$instruc\ons$directly$into$C$programs$through$use$of$intrinsics$– Achieve$efficiency$beyond$that$of$op\mizing$compiler$3/3/11$ 4$Spring$2011$NN$Lecture$#14$Agenda$• Amdahl’s$Law$• Administrivia$• SIMD$and$Loop$Unrolling$• Technology$Break$• Memory$Performance$for$Caches$• Review$of$1st$Half$of$61C$3/3/11$ 5$Spring$2011$NN$Lecture$#14$Big$Idea:$Amdahl’s$(Heartbreaking)$Law$• Speedup$due$to$enhancement$E$is$Speedup$w/$E$=$$$$$$NNNNNNNNNNNNNNNNNNNNNN$$$Exec$\me$w/o$E$Exec$\me$w/$E$$• Suppose$that$enhancement$E$accelerates$a$frac\on$F$$$(F$<1)$of$the$task$by$a$factor$S$(S>1)$and$the$remainder$of$the$task$is$unaffected$Execu\on$Time$w/$E$$=$Speedup$w/$E$$=$3/3/11$ 6$Fall$2010$NN$Lecture$#17$Execu\on$Time$w/o$E$$×$$[$(1NF)$+$F/S]$$1$/$[$(1NF)$+$F/S$]$3/3/11$2$Big$Idea:$Amdahl’s$Law$3/3/11$ 7$Fall$2010$NN$Lecture$#17$Speedup$$=$Example:$the$execu\on$\me$of$half$of$the$program$can$be$accelerated$by$a$factor$of$2.$What$is$the$program$speedNup$overall?$Big$Idea:$Amdahl’s$Law$3/3/11$ 8$Fall$2010$NN$Lecture$#17$Speedup$$=$$$$$$$$$$$$$$$$$$$$$$$1$Example:$the$execu\on$\me$of$half$of$the$program$can$be$accelerated$by$a$factor$of$2.$What$is$the$program$speedNup$overall?$$(1$N$F)$$$+$$$F$S$NonNspeedNup$part$SpeedNup$part$1$0.5$+$0.5$2$1$0.5$+$0.25$=$=$1.33$Big$Idea:$Amdahl’s$Law$3/3/11$ Fall$2010$NN$Lecture$#17$ 9$If$the$por\on$of$the$program$that$can$be$parallelized$is$small,$then$the$speedup$is$limited$The$nonNparallel$por\on$limits$the$performance$Example$#1:$Amdahl’s$Law$• Consider$an$enhancement$which$runs$20$\mes$faster$but$which$is$only$usable$25%$of$the$\me.$$ $ $Speedup$w/$E$$=$$$• What$if$its$usable$only$15%$of$the$\me?$$ $ $Speedup$w/$E$$=$$$• Amdahl’s$Law$tells$us$that$to$achieve$linear$speedup$with$100$processors,$none$of$the$original$computa\on$can$be$scalar!$• To$get$a$speedup$of$90$from$100$processors,$the$percentage$of$the$original$program$that$could$be$scalar$would$have$to$be$0.1%$or$less$$ $ $Speedup$w/$E$$=$$3/3/11$ Fall$2010$NN$Lecture$#17$ 10$$$$$$$$$$$$$$$$$$$$$Speedup$w/$E$=$$$$$Example$#1:$Amdahl’s$Law$• Consider$an$enhancement$which$runs$20$\mes$faster$but$which$is$only$usable$25%$of$the$\me$$ $ $Speedup$w/$E$$=$$1/(.75$+$.25/20)$$=$$1.31$• What$if$its$usable$only$15%$of$the$\me?$$ $ $Speedup$w/$E$$=$$1/(.85$+$.15/20)$$=$$1.17$• Amdahl’s$Law$tells$us$that$to$achieve$linear$speedup$with$100$processors,$none$of$the$original$computa\on$can$be$scalar!$• To$get$a$speedup$of$90$from$100$processors,$the$percentage$of$the$original$program$that$could$be$scalar$would$have$to$be$0.1%$or$less$$ $ $Speedup$w/$E$$=$$1/(.001$+$.999/100)$$=$$90.99$3/3/11$ Fall$2010$NN$Lecture$#17$ 11$Speedup$w/$E$=$$$1$/$[$(1NF)$+$F/S$]$Parallel$SpeedNup$Example$• 10$“scalar”$opera\ons$(nonNparallelizable)$• 100$parallelizable$opera\ons$• 110$opera\ons$– 100/110$=$.909$Parallelizable,$10/110$=$0.91$Scalar$3/3/11$ Fall$2010$NN$Lecture$#17$ 12$Z0$+$Z1$+$…$+$Z10$X1,1$X1,10$X10,1$X10,10$Y1,1$Y1,10$Y10,1$Y10,10$+$NonNparallel$part$ Parallel$part$Par\\on$10$ways$$and$perform$on$10$parallel$processing$units$3/3/11$3$Example$#2:$Amdahl’s$Law$• Consider$summing$10$scalar$variables$and$two$10$by$10$matrices$(matrix$sum)$on$10$processors$$ $Speedup$w/$E$$=$• What$if$there$are$100$processors$?$$ $Speedup$w/$E$$=$• What$if$the$matrices$are$100$by$100$(or$10,010$adds$in$total)$on$10$processors?$$ $Speedup$w/$E$$=$• What$if$there$are$100$processors$?$$ $Speedup$w/$E$$=$Speedup$w/$E$=$$$1$/$[$(1NF)$+$F/S]$3/3/11$ 13$Fall$2010$NN$Lecture$#17$Example$#2:$Amdahl’s$Law$• Consider$summing$10$scalar$variables$and$two$10$by$10$matrices$(matrix$sum)$on$10$processors$Speedup$w/$E$$=$$1/(.091$+$.909/10)$$=$$1/0.1819$=$5.5$• What$if$there$are$100$processors$?$Speedup$w/$E$$=$$1/(.091$+$.909/100)$=$1/0.10009$=$10.0$• What$if$the$matrices$are$33$by$33(or$1019$adds$in$total)$on$10$processors?$(increase$parallel$data$by$10x)$Speedup$w/$E$$=$$1/(.009$+$.991/10)$$=$$1/0.108$=$9.2$• What$if$there$are$100$processors$?$Speedup$w/$E$$=$$1/(.009$+$.991/100)$=$1/0.019$=$52.6$Speedup$w/$E$=$$$1$/$[$(1NF)$+$F/S$]$3/3/11$ 14$Fall$2010$NN$Lecture$#17$Strong$and$Weak$Scaling$• To$get$good$speedup$on$a$mul\processor$while$keeping$the$problem$size$fixed$is$harder$than$gexng$good$speedup$by$increasing$the$size$of$the$problem.$– Strong%scaling:$when$speedup$can$be$achieved$on$a$parallel$processor$without$increasing$the$size$of$the$problem$(e.g.,$10x10$Matrix$on$10$processors$to$100)$– Weak%scaling:$when$speedup$is$achieved$on$a$parallel$processor$$by$increasing$the$size$of$the$problem$propor\onally$to$the$increase$in$the$number$of$processors$– (e.g.,$10x10$Matrix$on$10$processors$=>33x33$Matrix$on$100)$• Load$balancing$is$another$important$factor:$every$processor$doing$same$amount$of$work$$$– Just$1$unit$with$twice$the$load$of$others$cuts$speedup$almost$in$half$3/3/11$ Fall$2010$NN$Lecture$#17$ 15$Peer$Review$•
View Full Document