Page 1 1 CS6810 School of Computing University of Utah Quantitative Analysis Today’s topics: • failure analysis • performance analysis • some basic quantitative principles • caution – pot holes – it’s easy to lie w/ numbers 2 CS6810 School of Computing University of Utah Some Issues So Far • And it’s only the 2nd Class • You’ll note my preference conceptual stuff in the lectures practical stuff in the homeworks » give me feedback when this approach isn’t good enough • Text isn’t in the bookstore major screw-up » due to late teaching assignment change » order it on-line • it’ll be faster and cheaper • Homework #1 will be on the web later today make sure you start early holiday weekend ahead » maybe you’d like to enjoy it 3 CS6810 School of Computing University of Utah Reliability • Reliability is a key concern in some segments mission critical embedded systems » e.g. nuclear power plants, automotive, aero & space, … when high availability is needed » either due to monetary loss or contract • SLA’s and SLO’s • Weakest link theory useful acronyms (note these are averages & “user mileage may vary”) » MTTF – mean time to failure » MTTR – mean time to repair » MTBF (B=between) = MTTF + MTTR » availability = MTTF/MTBF hook? » simple for a module – more complex for a larger system 4 CS6810 School of Computing University of Utah Failure Mechanisms • 2 types hard – permanent failure transient – temporary failure » due to environmental issues • alpha particles, heat, cross-talk, noise, vibration, … • Device specific (small set of examples) IC’s » transistors can fail due to excess heat & current • extremely reliable in general » wires fail due to excess current – “metal migration” Disks (checkout recent Google paper on this) » MHD’s: oxide deterioration, head saturation, coil-motor accuracy » SSD’s: block erase oxide thinning DRAM’s (checkout recent Google paper on this too!) » IC’s but alpha particles disrupt stored chargePage 2 5 CS6810 School of Computing University of Utah Improving Reliability • 2 strategies build more reliable devices » more costly & a very slippery slope use more of them redundancy • Redundancy shows up in lots of costumes extra bits – CRC & ECC codes » even more exotic: Turbo, Viterbi, etc. extra gates and wires » seldom used today redundant blocks » 2: compare and signal error if they don’t agree » some odd number: vote and take majority, flag anyway redundant everything » retry elsewhere if something fails hybrid » e.g. NAND Flash – ECC on block, quarantine block before things get nasty 6 CS6810 School of Computing University of Utah Performance • 2 aspects throughput: rate of completion of multiple jobs, processes, or threads single thread performance or execution time making one better usually degrades the other • Comparing: performance = 1/execution_time similar game for throughput comparisons 7 CS6810 School of Computing University of Utah Measuring Performance • Tricky in today’s multiprocessing world alias factors » elapsed time (stopwatch) is load dependent » context switch • process is swapped out part of the time it’s supposedly running » page faults • only fair if your workload is the only one running » I/O delays • processing may be dwarfed by slow I/O response time » OS overheads • fair if OS service is important part of your workload • unfair if service to other workloads are observed • Fortunately tools exist to help break out time into different bins » still some cruft gets swept under the rug 8 CS6810 School of Computing University of Utah Tools • Unix time command otb> time » 0.898u 0.311s 2:39.79 0.7% 0+0k 0+0io 9pf+02 meaning » u = seconds of user process execution time » s = seconds of system execution time (OS) » 2:39.79 minutes of elapsed time • includes page faults, I/O overhead, etc. (a.k.a. external overheads) » k = KB of text + data used » io = amount of i/o sent » pf: major plus minor page faults • major: page was on disk • minor: TLB miss but page in main-memory (DRAM) Beware: OS “system time” undervalued » call and return linkages usually charged to user time • Higher fidelity use on chip counters via some tool like Intel’s vTunePage 3 9 CS6810 School of Computing University of Utah Lots of Performance Analysis Tools • Key is to learn what they’re good at some are good at » tracking certain HW events – cache misses, TLB misses, IPC » course grained phase changes • aggregate finer details into a larger “average” • Point use the right tool for the job seems obvious but often users don’t get it • Some things are very hard each tool has a “probe effect” » often hard to determine the overhead • partially because it may be inconsistent 10 CS6810 School of Computing University of Utah Evaluating Machines • Which programs do you choose? real programs » ideal but problematic • you can’t just read about them • it’s a lot of work • what you care about may be diverse and change over time kernels » computationally intensive pieces of your programs • same problem as above PLUS – you have to profile your code to find the right stuff – intuition of where the time goes is suspect • use existing kernels – e.g. Livermore Loops & Linpack – small loops over big data sets – good chance they don’t represent your computational needs – not real programs anyway – just stress the CPU • What would you do? without looking at the next slide! 11 CS6810 School of Computing University of Utah Benchmarks • Industry standard reporting mechanism burden » need to understand what the benchmark measures • int, float, cache, main-memory, interconnect, …. » enormous diversity in today’s benchmarks • Common benchmark suites SPEC:
View Full Document