New version page

Latency as a Performability Metric

Upgrade to remove ads

This preview shows page 1-2-24-25 out of 25 pages.

Save
View Full Document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience

Upgrade to remove ads
Unformatted text preview:

Latency as a Performability Metric: Experimental ResultsOutlinePerformability overviewExample: microbenchmarkProject motivationProject overviewExperiment componentsMendosus designExperimental setupFault typesTest case timelineSimplifying assumptionsSample result: app crashSample result: node hangRepresenting latencyRepresenting latency (2)Representing latency (3)Availability and punctualityOther metricsSample demerit systemOnline service optimizationConclusionsFurther WorkLatency as a Performability Metric: Experimental ResultsExample: long-term modelLatency as a Latency as a Performability Performability Metric: Experimental ResultsMetric: Experimental ResultsPete [email protected] Motivation and background• Performability overview•Project summary2. Test setup• PRESS web server• Mendosus fault injection system3. Experimental results & analysis• How to represent latency• Questions for future researchPerformability overview• Goal of ROC project: develop metrics to evaluate new recovery techniques• Performability –class of metrics to describe how a system performs in the presence of faults– First used in fault-tolerant computing field1 – Now being applied to online services 1 J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, 1994Example: microbenchmarkRAID disk failureProject motivation• Rutgers study: performability analysis of a web server, using throughput• Other studies (esp. from HP Labs Storage group) also use response time as a metric• Assertion: latency and data quality are better than throughput for describing user experience • How best to represent latency in performability reports?Project overview•Goals:1. Replicate PRESS/Mendosus study with response time measurements2. Discuss how to incorporate latency into performability statistics• Contributions:1. Provide a latency-based analysis of a web server’s performability (currently rare)2. Further the development of more comprehensive dependability benchmarksExperiment components• The Mendosus fault injection system– From Rutgers (Rich Martin)– Goal: low-overhead emulation of a cluster of workstations, injection of likely faults• The PRESS web server– Cluster-based, uses cooperative caching. Designed by Carreira et al. (Rutgers)– Perf-PRESS: basic version– HA-PRESS: incorporates hearbeats, master node for automated cluster management• Client simulators– Submit set # of requests/sec, based on real tracesMendosus designUser-leveldaemon (Java)ModifiedNICdriverSCSImoduleprocmoduleappsGlobal Controller(Java)Fault configfileLANemuconfigfileAppsconfigfileEmulated LANWorkstations (real or VMs)Experimental setupFault typesDamaged or misconfigured switch, power outageSwitch down or flakyBroken, damaged or misattached cableLink down or flakyNetworkApplication bug or resource contention with other processesApp hangApplication bug or resource unavailabilityApp crashApplicationOS or kernel module bugNode freezeOperator error, OS bug, hardware component failure, power outageNode crashNodePossible Root CauseFaultCategoryTest case timeline- Warm-up time: 30-60 seconds- Time to repair: up to 90 secondsSimplifying assumptions• Operator repairs any non-transient failure after 90 seconds• Web page size is constant• Faults are independent• Each client request is independent of all others (no sessions!)– Request arrival times are determined by a Poisson process (not self-similar)• Simulated clients abandon connection attempt after 2 secs, give up on page load after 8 secsSample result: app crash01002003004005006007008009000102030405060708090100110120130140150160170180Time elapsed (secs)# RequestsSuccessAborted (2s)Timed out (8s)00.511.522.50102030405060708090100110120130140150160170180Time elapsed (secs)Avg. response time (secs)Latency00.511.522.50102030405060708090100110120130140150160170180Time elapsed (secs)Avg. response time (secs)Latency01002003004005006007008009000102030405060708090100110120130140150160170180Time elapsed (secs)# RequestsSuccessAborted (2s)Timed out (8s)Perf-PRESS HA-PRESSThroughputLatency01002003004005006007008009000102030405060708090100110120130140150160170180190200210220230240250260270280290300310320330340350360370380390400410Time elapsed (secs)# RequestsSuccessAborted (2s)Timed out (8s)00.511.522.530102030405060708090100110120130140150160170180190200210220230240250260270280290300310320330340350360370380390400410Time elapsed (secs)Avg. response time (secs)LatencyPerf-PRESS HA-PRESS01002003004005006007008009000102030405060708090100110120130140150160170180190200210220230240250260270280290300310320330340350360370380390400410Time elapsed (secs)# RequestsSuccessAborted (2s)Timed out (8s)00.511.522.530102030405060708090100110120130140150160170180190200210220230240250260270280290300310320330340350360370380390400410Time elapsed (secs)Avg. response time (secs)LatencyThroughputLatencySample result: node hangRepresenting latency• Total seconds of wait time– Not good for comparing cases with different workloads• Average (mean) wait time per request– OK, but requires that expected (normal) response time be given separately• Variance of wait time– Not very intuitive to describe. Also, read-only workload means that all variance is toward longer wait times anywayRepresenting latency (2)• Consider “goodput”-based availability:total responses servedtotal requests• Idea: Latency-based “punctuality”:ideal total latencyactual total latency• Like goodput, maximum value is 1• “Ideal” total latency:average latency for non-fault cases x total #requests (shouldn’t be 0)Representing latency (3)• Aggregate punctuality ignores brief, severe spikes in wait time (bad for user experience)– Can capture these in a separate statistic (EX: 1% of 100k responses took >8 sec)Availability and punctualityThroughput-based availability of PRESS versions00.20.40.60.81App hang App crash Node crash Node freeze Link downFault scenarioAvailability indexPerf-PRESSHA-PRESSLatency-based "punctuality" of PRESS versions00.020.040.060.080.10.12App hang App crash Node crash Node freeze Link downFault scenarioPunctuality indexPerf-PRESSHA-PRESSOther metrics• Data quality, latency and throughput are interrelated– Is a 5-second wait for a response “worse” than waiting 1 second to get a “try back later”?• To combine DQ, latency and throughput, can use a “demerit” system (proposed by Keynote)1– These can be very


Download Latency as a Performability Metric
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Latency as a Performability Metric and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Latency as a Performability Metric 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?