DOC PREVIEW
Berkeley COMPSCI 252 - Lecture Notes

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Page 1CS252/PattersonLec 13.13/2/01CS252Graduate Computer ArchitectureLecture 13: Multiprocessor 3: Measurements,Crosscutting Issues, Examples,Fallacies & PitfallsMarch 2, 2001Prof. David A. PattersonComputer Science 252Spring 2001CS252/PattersonLec 13.23/2/01Review• Caches contain all information on state ofcached memory blocks• Snooping and Directory Protocols similar• Bus makes snooping easier because of broadcast(snooping => Uniform Memory Access)• Directory has extra data structure to keeptrack of state of all cache blocks• Distributing directory=> scalable shared address multiprocessor=> Cache coherent, Non Uniform Memory Access(NUMA)CS252/PattersonLec 13.33/2/01Parallel App: Commercial Workload• Online transaction processing workload(OLTP) (like TPC-B or -C)• Decision support system (DSS) (like TPC-D)• Web index search (Altavista)Benchmark % TimeUserMode% TimeKernel% TimeI/O time(CPU Idle)OLTP 71% 18% 11%DSS (range) 82-94% 3-5% 4-13%DSS (avg) 87% 4% 9%Altavista > 98% < 1% <1%CS252/PattersonLec 13.43/2/01Alpha 4100 SMP• 4 CPUs• 300 MHz Apha 211264 @ 300 MHz• L1$ 8KB direct mapped, write through• L2$ 96KB, 3-way set associative• L3$ 2MB (off chip), direct mapped• Memory latency 80 clock cycles• Cache to cache 125 clock cyclesCS252/PattersonLec 13.53/2/01OLTP Performance as vary L3$ size01020304050607080901001 MB 2 MB 4 MB 8MBL3 Cache SizeIdlePAL CodeMemory AccessL2/L3 Cache AccessInstruction ExecutionCS252/PattersonLec 13.63/2/01L3 Miss Breakdown00.250.50.7511.251.51.7522.252.52.7533.251 MB 2 MB 4 MB 8 MBCache sizeInstructionCapacity/ConflictColdFalse SharingTrue SharingPage 2CS252/PattersonLec 13.73/2/01Memory CPI as increase CPUs00.511.522.531 2 4 6 8Processor countInstructionConflict/CapacityColdFalse SharingTrue SharingCS252/PattersonLec 13.83/2/01OLTP Performance as vary L3$ size01234567891011121314151632 64 128 256Block size in bytesInsructionCapacity/ConflictColdFalse SharingTrue SharingCS252/PattersonLec 13.93/2/01NUMA Memory performance forScientific Apps on SGI Origin 2000• Show average cycles per memory referencein 4 categories:• Cache Hit• Miss to local memory• Remote miss to home• 3-network hop miss to remote cacheCS252/PattersonLec 13.103/2/01CS 252 Administrivia• Quiz #1 Wed March 7 5:30-8:30 306 Soda– No Lecture• La Val's afterward quiz: free food and drink• 3 questions• Bring pencils• Bring sheet of paper with notes on 2 sides• Bring calculator (but don’t store notes in calculator)CS252/PattersonLec 13.113/2/01SGI Origin 2000• a pure NUMA• 2 CPUs per node,• Scales up to 2048 processors• Design for scientific computation vs.commercial processing• Scalable bandwidth is crucial to OriginCS252/PattersonLec 13.123/2/01Parallel App: Scientific/Technical• FFT Kernel: 1D complex number FFT– 2 matrix transpose phases => all-to-all communication– Sequential time for n data points: O(n log n)– Example is 1 million point data set• LU Kernel: dense matrix factorization– Blocking helps cache miss rate, 16x16– Sequential time for nxn matrix: O(n3)– Example is 512 x 512 matrixPage 3CS252/PattersonLec 13.133/2/01FFT KernelFFT0.00.51.01.52.02.53.03.54.04.55.05.58 16 32 64Processor count3-hop miss to remotecacheRemote miss to homeMiss to local memoryHitCS252/PattersonLec 13.143/2/01LU kernelLU0.00.51.01.52.02.53.03.54.04.55.05.58 16 32 64Processor count3-hop miss to remotecacheRemote miss to homeMiss to local memoryHitCS252/PattersonLec 13.153/2/01Parallel App: Scientific/Technical• Barnes App: Barnes-Hut n-body algorithm solvinga problem in galaxy evolution– n-body algs rely on forces drop off with distance;if far enough away, can ignore (e.g., gravity is 1/d2)– Sequential time for n data points: O(n log n)– Example is 16,384 bodies• Ocean App: Gauss-Seidel multigrid technique tosolve a set of elliptical partial differential eq.s’– red-black Gauss-Seidel colors points in grid to consistentlyupdate points based on previous values of adjacent neighbors– Multigrid solve finite diff. eq. by iteration using hierarch. Grid– Communication when boundary accessed by adjacent subgrid– Sequential time for nxn grid: O(n2)– Input: 130 x 130 grid points, 5 iterationsCS252/PattersonLec 13.163/2/01Barnes AppBarnes0.00.51.01.52.02.53.03.54.04.55.05.58 16 32 64Processor count3-hop miss to remotecacheRemote miss to homeMiss to local memoryHitCS252/PattersonLec 13.173/2/01Ocean AppOcean0.00.51.01.52.02.53.03.54.04.55.05.58 16 32 64Average cycles per reference3-hop miss to remotecacheRemote missLocal missCache hitCS252/PattersonLec 13.183/2/01Cross Cutting Issues: PerformanceMeasurement of Parallel Processors• Performance: how well scale as increase Proc• Speedup fixed as well as scaleup of problem– Assume benchmark of size n on p processors makes sense: howscale benchmark to run on m * p processors?– Memory-constrained scaling: keeping the amount of memoryused per processor constant– Time-constrained scaling: keeping total execution time,assuming perfect speedup, constant• Example: 1 hour on 10 P, time ~ O(n3), 100 P?– Time-constrained scaling: 1 hour, => 101/3n => 2.15n scale up– Memory-constrained scaling: 10n size => 103/10 => 100X or100 hours! 10X processors for 100X longer???– Need to know application well to scale: # iterations, errortolerancePage 4CS252/PattersonLec 13.193/2/01Cross Cutting Issues:Memory System Issues• Multilevel cache hierarchy + multilevelinclusion—every level of cache hierarchy is asubset of next level—then can reduce contentionbetween coherence traffic and processor traffic– Hard if cache blocks different sizes• Also issues in memory consistency model andspeculation, nonblocking caches, prefetchingCS252/PattersonLec 13.203/2/01Example: Sun Wildfire Prototype• Connect 2-4 SMPs via optional NUMA technology– Use “off-the-self” SMPs as building block• For example, E6000 up to 15 processor or I/Oboards (2 CPUs/board)– Gigaplane bus interconnect, 3.2 Gbytes/sec• Wildfire Interface board (WFI) replace a CPUboard => up to 112 processors (4 x 28),– WFI board supports one coherent address space across 4 SMPs– Each WFI has 3 ports connect to up to 3 additional nodes,each with a dual directional 800 MB/sec connection– Has a directory cache in WFI interface: local or clean OK,otherwise sent to home node– Multiple bus transactionsCS252/PattersonLec 13.213/2/01Example: Sun Wildfire Prototype• To reduce contention for page, has


View Full Document

Berkeley COMPSCI 252 - Lecture Notes

Documents in this Course
Quiz

Quiz

9 pages

Caches I

Caches I

46 pages

Lecture 6

Lecture 6

36 pages

Lecture 9

Lecture 9

52 pages

Figures

Figures

26 pages

Midterm

Midterm

15 pages

Midterm

Midterm

14 pages

Midterm I

Midterm I

15 pages

ECHO

ECHO

25 pages

Quiz  1

Quiz 1

12 pages

Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?