DOC PREVIEW
CMU CS 15740 - keeton_isca98

This preview shows page 1-2-3-4 out of 12 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

AbstractCommercial applications are an important, yet often over-looked, workload with significantly different characteris-tics from technical workloads. The potential impact ofthese differences is that computers optimized for technicalworkloads may not provide good performance for commer-cial applications, and these applications may not fullyexploit advances in processor design. To evaluate theseissues, we use hardware counters to measure architecturalfeatures of a four-processor Pentium Pro-based serverrunning a TPC-C-like workload on an Informix database.We examine the effectiveness of out-of-order execution,branch prediction, speculative execution, superscalar issueand retire, caching and multiprocessor scaling. We findthat out-of-order execution, superscalar issue and retire,and branch prediction are not as effective for databaseworkloads as they are for technical workloads, such asSPEC. We find that caches are effective at reducing pro-cessor traffic to memory; even larger caches would behelpful to satisfy more data requests. Multiprocessor scal-ing of this workload is good, but even modest bus utiliza-tion degrades application memory latency, limitingdatabase throughput.1 IntroductionCommercial applications are an important class ofapplications with a large installed base. According toDataquest, commercial applications, such as transactionprocessing and decision support database service, file ser-vice, media and email service, print service, and customapplications, were the dominant applications run on servermachines in 1995 and are projected to be the dominantserver applications in 2000 [25]. Commercial applicationscomprised about 85% of the 1995 server market, and areprojected continue this dominance as the server marketgrows 15 percent annually. Database workloads alone motivate the sale of vastquantities of symmetric multiprocessing machines, andhold the dominant fraction of the massively parallel com-puting market [18]: databases motivated 32% of the servervolume in 1995, and will motivate 39% of the 2000 servervolume [25]. Despite the widespread usage of commercialapplications, they are often ignored in preference to techni-cal benchmarks, such as SPEC or LINPACK, in computerarchitecture performance studies. This bias is due largelyto the lack of available representative multi-user traces ofcommercial applications, the proprietary nature of databaseperformance information and source code, and the diffi-culty of properly configuring a system to run typical data-base benchmarks. Commercial and technical applications have signifi-cantly different execution characteristics [15]. Commercialapplications generally have a large number (e.g., 100s to1000s) of concurrent users. As a result, they typically havehigh context switch rates and multiprogramming levels.They spend a substantial portion of their execution in theoperating system. Commercial applications perform manyI/O operations, in a random access pattern, with dataspread over a wide portion of a disk. As a result, much oftheir execution time is spent waiting for I/O completions.Commercial applications perform data manipulation onstrings or integers, in comparison with the extensive float-ing point activity in technical workloads. Unlike the smallinstruction working sets and tight loops of technical appli-cations, commercial applications execute fewer loopinstructions, and often use non-looping branch instructions.Because of their branching behavior and data access pat-terns, commercial applications have been less able to effec-tively use the memory system of traditional workstationand server architectures. The potential implication of these differences is pro-found: computers optimized for technical workloads maynot provide good performance for commercial applica-Performance Characterization of a Quad Pentium Pro SMP Using OLTP WorkloadsKimberly Keeton*, David A. Patterson*, Yong Qiang He+, Roger C. Raphael+, and Walter E. Baker+*Computer Science DivisionUniversity of California at Berkeley387 Soda Hall #1776Berkeley, CA 94720-1776{kkeeton,patterson}@cs.berkeley.edu+Informix Software, Inc.4100 Bohannon DriveMenlo Park, CA 94025{johnq,rogerr,web}@informix.comTo appear in Proceedings of the 25th International Symposium on Computer Architecture, Barcelona, Spain, June 1998.tions, and these applications may not exploit advances inprocessors at the same rate as SPEC. This problem is exac-erbated by the fact that I/O and memory system perfor-mance improvement rates lag far behind processorperformance improvements. As a result, it is important forcomputer architects to consider a wide range of applica-tions when designing and evaluating architectures, espe-cially those intended to be used in SMPs.In this paper, we use hardware counters to measurearchitectural features of a four-processor Pentium Pro-based server running a commercial database executing aTPC-C-like workload. We vary several hardware and firm-ware configuration parameters, such as L2 cache size, mainmemory bandwidth, the number of processors and thenumber of outstanding bus transactions, to evaluate hard-ware design trade-offs. We examine the efficiency of cach-ing, out-of-order execution, branch prediction, speculativeexecution, superscalar issue and retire and multiprocessorscaling. We find that overall (e.g., database and operating sys-tem) CPI is roughly five times higher than the theoreticalminimum CPI for the architecture, and much higher thanthe CPI of SPEC. Resource and instruction-related stallscomprise the majority of these cycles. While out-of-orderexecution is somewhat effective at hiding memory hierar-chy latency and other stalls, it is less effective for databaseworkloads than for SPEC. The branch prediction algo-rithms and hardware support do not work nearly as well fordatabase workloads. Superscalar issue and retire is onlymarginally helpful for this workload. Not surprisingly, we found that caches are effective atreducing the processor traffic to memory. Our data supportthe rule of thumb that doubling the L2 cache size givesabout half the benefit seen from the previous doubling.While larger caches are effective, this benefit is not with-out consequences. Coherence traffic, in the form of cachemisses to dirty data in other processors’ caches, increasesas caches get bigger, and as the number of processorsincreases. We find that the exclusive state of the four-stateMESI cache coherence protocol is under-utilized, andcould likely be


View Full Document

CMU CS 15740 - keeton_isca98

Documents in this Course
leecture

leecture

17 pages

Lecture

Lecture

9 pages

Lecture

Lecture

36 pages

Lecture

Lecture

9 pages

Lecture

Lecture

13 pages

lecture

lecture

25 pages

lect17

lect17

7 pages

Lecture

Lecture

65 pages

Lecture

Lecture

28 pages

lect07

lect07

24 pages

lect07

lect07

12 pages

lect03

lect03

3 pages

lecture

lecture

11 pages

lecture

lecture

20 pages

lecture

lecture

11 pages

Lecture

Lecture

9 pages

Lecture

Lecture

10 pages

Lecture

Lecture

22 pages

Lecture

Lecture

28 pages

Lecture

Lecture

18 pages

lecture

lecture

63 pages

lecture

lecture

13 pages

Lecture

Lecture

36 pages

Lecture

Lecture

18 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

lecture

lecture

34 pages

lecture

lecture

47 pages

lecture

lecture

7 pages

Lecture

Lecture

18 pages

Lecture

Lecture

7 pages

Lecture

Lecture

21 pages

Lecture

Lecture

10 pages

Lecture

Lecture

39 pages

Lecture

Lecture

11 pages

lect04

lect04

40 pages

Load more
Download keeton_isca98
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view keeton_isca98 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view keeton_isca98 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?