Unformatted text preview:

Spring 2010Prof. Hyesoon Kim• Benchmarking is critical to make a design decision and measuring performance – Performance evaluations:• Design decisions – Earlier time : analytical based evaluations – From 90’s: heavy rely on simulations. • Processor evaluations – Workload characterizations: better understand the workloads• Benchmarks– Real applications and application suites• E.g., SPEC CPU2000, SPEC2006, TPC-C, TPC-H, EEMBC, MediaBench, PARSEC, SYSmark– Kernels• “Representative” parts of real applications• Easier and quicker to set up and run• Often not really representative of the entire app– Toy programs, synthetic benchmarks, etc.• Not very useful for reporting• Sometimes used to test/stress specific functions/features• GFLOPS, TFLOPS • MIPS (Million instructions per second)• Speedup of arithmeitc means != arithmetic mean of speedup• Use geometric mean:• Neat property of the geometric mean:Consistent whatever the reference machine• Do not use the arithmetic mean for normalized execution timesnnii1on timeexecution Normalized• Often when making comparisons in comp-arch studies:– Program (or set of) is the same for two CPUs– The clock speed is the same for two CPUs• So we can just directly compare CPI’s and often we use IPC’s• Average CPI = (CPI1+ CPI2+ … + CPIn)/n• A.M. of IPC = (IPC1+ IPC2+ … + IPCn)/n• Must use Harmonic Mean to remain  to runtimeNot Equal to A.M. of CPI!!!• H.M.(x1,x2,x3,…,xn) = n1 + 1 + 1 + … + 1x1x2x3xn• What in the world is this?– Average of inverse relationships• “Average” IPC = 1A.M.(CPI)= 1CPI1+ CPI2+ CPI3+ … + CPInn n n n= nCPI1+ CPI2+ CPI3+ … + CPIn= n1 + 1 + 1 + … + 1 = H.M.(IPC)IPC1IPC2IPC3IPCn• Stanford graphics benchmarks – Simple graphics workload. Academic • Mostly game applications– 3DMark: – http://www.futuremark.com/benchmarks/3dmarkvantage– Tom’s hardware• Still graphics is the major performance bottlenecks • Previous research: emphasis on graphics• Several genres of video games– First Person Shooter• Fast-paced, graphically enhanced• Focus of this presentation– Role-Playing Games• Lower graphics and slower play– Board Games• Just plain boringEventPhysicsCollisionDetectionParticleAIRendering DisplayComputing• Current game design principles:– higher frame rates imply the better game quality• Recent study on frame rates [Claypool et al. MMCN 2006]– very high frame rates are not necessary, very low frame rates impact the game quality severelySnapshots of animation [Davis et al. Eurographics 2003]timeGame workloadComputational workloadRendering workloadOther workloadRasterization workload• Case study– Workload characterization of 3D games, Roca, et al. IISWC 2006 [WOR]– Use ATTILA• Average primitives per frame • Average vertex shader instructions • Vertex cache hit ratio• System bus bandwidths • Percentage of clipped, culled, and traversed triangles • Average triangle sizes• GPU execution driven simulator• https://attilaac.upc.edu/wiki/index.php/Architecture • Can simulate OpenGL at this moments• Attila architecture Index BufferVertex cache Vertex Request BufferStreamerPrimitive AssemblyClippingTriangle SetupFragment GenerationHierarchical Z HZ CacheHierarchicalZ bufferZ CacheZ testZ CacheZ testInterpolatorColor cacheBlendColor cacheBlendMC0 MC1 MC2MC3RegisterFileTextureCacheTextureAddressTextureFilterShaderShaderShaderShaderShaderUnit Size Element widthStreamer 48 16x4x32 bitsPrimitive Assembly 8 3x16x4x32 bitsClipping 4 3x4x32 bitsTriangle Setup 12 3x4x32 bitsFragment Generation 16 3x4x32 bitsHierarchical Z 64 (2x16+4x32)x4 bitsZ Tests 64 (2x16+4x32)x4 bitsInterpolator --- ---Color Write 64 (2x16+4x32)x4 bitsUnified Shader (vertex) 12+4 16x4x32 bitsUnified Shader (fragment) 240+16 10x4x32 bitsTable 2. Queue sizes and number of threads in the ATTILA reference architecture• Execution driven:– Correctness, long development time,– Execute binary • Trace driven– Easy to develop– Simulation time could be shorten – Large trace file size• No simulation is required • To provide insights • Statistical Methods • CPU – First-order • GPU– Warp level parallelism• Hardware performance counters – Built in counters (instruction count, cache misses, branch mispredicitons)• Profiler • Architecture simulator• Characterized items – Cache miss, branch misprediciton, row-buffer hit ratio• States Lab setting • Recommended deadline (1/25) – No penalty until 1/27 • Newsgroup:– Active participants will get extra credit • Lab assignment TAing– Volunteer – Graduate (who have taken CS6290 course)– Send email to


View Full Document

GT CS 4803 - LECTURE NOTES

Download LECTURE NOTES
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view LECTURE NOTES and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view LECTURE NOTES 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?