AbstractCommercial applications are an important, yet often over-looked, workload with significantly different characteris-tics from technical workloads. The potential impact ofthese differences is that computers optimized for technicalworkloads may not provide good performance for commer-cial applications, and these applications may not fullyexploit advances in processor design. To evaluate theseissues, we use hardware counters to measure architecturalfeatures of a four-processor Pentium Pro-based serverrunning a TPC-C-like workload on an Informix database.We examine the effectiveness of out-of-order execution,branch prediction, speculative execution, superscalar issueand retire, caching and multiprocessor scaling. We findthat out-of-order execution, superscalar issue and retire,and branch prediction are not as effective for databaseworkloads as they are for technical workloads, such asSPEC. We find that caches are effective at reducing pro-cessor traffic to memory; even larger caches would behelpful to satisfy more data requests. Multiprocessor scal-ing of this workload is good, but even modest bus utiliza-tion degrades application memory latency, limitingdatabase throughput.1 IntroductionCommercial applications are an important class ofapplications with a large installed base. According toDataquest, commercial applications, such as transactionprocessing and decision support database service, file ser-vice, media and email service, print service, and customapplications, were the dominant applications run on servermachines in 1995 and are projected to be the dominantserver applications in 2000 [25]. Commercial applicationscomprised about 85% of the 1995 server market, and areprojected continue this dominance as the server marketgrows 15 percent annually. Database workloads alone motivate the sale of vastquantities of symmetric multiprocessing machines, andhold the dominant fraction of the massively parallel com-puting market [18]: databases motivated 32% of the servervolume in 1995, and will motivate 39% of the 2000 servervolume [25]. Despite the widespread usage of commercialapplications, they are often ignored in preference to techni-cal benchmarks, such as SPEC or LINPACK, in computerarchitecture performance studies. This bias is due largelyto the lack of available representative multi-user traces ofcommercial applications, the proprietary nature of databaseperformance information and source code, and the diffi-culty of properly configuring a system to run typical data-base benchmarks. Commercial and technical applications have signifi-cantly different execution characteristics [15]. Commercialapplications generally have a large number (e.g., 100s to1000s) of concurrent users. As a result, they typically havehigh context switch rates and multiprogramming levels.They spend a substantial portion of their execution in theoperating system. Commercial applications perform manyI/O operations, in a random access pattern, with dataspread over a wide portion of a disk. As a result, much oftheir execution time is spent waiting for I/O completions.Commercial applications perform data manipulation onstrings or integers, in comparison with the extensive float-ing point activity in technical workloads. Unlike the smallinstruction working sets and tight loops of technical appli-cations, commercial applications execute fewer loopinstructions, and often use non-looping branch instructions.Because of their branching behavior and data access pat-terns, commercial applications have been less able to effec-tively use the memory system of traditional workstationand server architectures. The potential implication of these differences is pro-found: computers optimized for technical workloads maynot provide good performance for commercial applica-Performance Characterization of a Quad Pentium Pro SMP Using OLTP WorkloadsKimberly Keeton*, David A. Patterson*, Yong Qiang He+, Roger C. Raphael+, and Walter E. Baker+*Computer Science DivisionUniversity of California at Berkeley387 Soda Hall #1776Berkeley, CA 94720-1776{kkeeton,patterson}@cs.berkeley.edu+Informix Software, Inc.4100 Bohannon DriveMenlo Park, CA 94025{johnq,rogerr,web}@informix.comTo appear in Proceedings of the 25th International Symposium on Computer Architecture, Barcelona, Spain, June 1998.tions, and these applications may not exploit advances inprocessors at the same rate as SPEC. This problem is exac-erbated by the fact that I/O and memory system perfor-mance improvement rates lag far behind processorperformance improvements. As a result, it is important forcomputer architects to consider a wide range of applica-tions when designing and evaluating architectures, espe-cially those intended to be used in SMPs.In this paper, we use hardware counters to measurearchitectural features of a four-processor Pentium Pro-based server running a commercial database executing aTPC-C-like workload. We vary several hardware and firm-ware configuration parameters, such as L2 cache size, mainmemory bandwidth, the number of processors and thenumber of outstanding bus transactions, to evaluate hard-ware design trade-offs. We examine the efficiency of cach-ing, out-of-order execution, branch prediction, speculativeexecution, superscalar issue and retire and multiprocessorscaling. We find that overall (e.g., database and operating sys-tem) CPI is roughly five times higher than the theoreticalminimum CPI for the architecture, and much higher thanthe CPI of SPEC. Resource and instruction-related stallscomprise the majority of these cycles. While out-of-orderexecution is somewhat effective at hiding memory hierar-chy latency and other stalls, it is less effective for databaseworkloads than for SPEC. The branch prediction algo-rithms and hardware support do not work nearly as well fordatabase workloads. Superscalar issue and retire is onlymarginally helpful for this workload. Not surprisingly, we found that caches are effective atreducing the processor traffic to memory. Our data supportthe rule of thumb that doubling the L2 cache size givesabout half the benefit seen from the previous doubling.While larger caches are effective, this benefit is not with-out consequences. Coherence traffic, in the form of cachemisses to dirty data in other processors’ caches, increasesas caches get bigger, and as the number of processorsincreases. We find that the exclusive state of the four-stateMESI cache coherence protocol is under-utilized, andcould likely be
View Full Document