To appear in Proceedings of the 25th International Symposium on Computer Architecture Barcelona Spain June 1998 Performance Characterization of a Quad Pentium Pro SMP Using OLTP Workloads Kimberly Keeton David A Patterson Yong Qiang He Roger C Raphael and Walter E Baker Computer Science Division University of California at Berkeley 387 Soda Hall 1776 Berkeley CA 94720 1776 kkeeton patterson cs berkeley edu Informix Software Inc 4100 Bohannon Drive Menlo Park CA 94025 johnq rogerr web informix com Abstract projected continue this dominance as the server market grows 15 percent annually Database workloads alone motivate the sale of vast quantities of symmetric multiprocessing machines and hold the dominant fraction of the massively parallel computing market 18 databases motivated 32 of the server volume in 1995 and will motivate 39 of the 2000 server volume 25 Despite the widespread usage of commercial applications they are often ignored in preference to technical benchmarks such as SPEC or LINPACK in computer architecture performance studies This bias is due largely to the lack of available representative multi user traces of commercial applications the proprietary nature of database performance information and source code and the difficulty of properly configuring a system to run typical database benchmarks Commercial and technical applications have significantly different execution characteristics 15 Commercial applications generally have a large number e g 100s to 1000s of concurrent users As a result they typically have high context switch rates and multiprogramming levels They spend a substantial portion of their execution in the operating system Commercial applications perform many I O operations in a random access pattern with data spread over a wide portion of a disk As a result much of their execution time is spent waiting for I O completions Commercial applications perform data manipulation on strings or integers in comparison with the extensive floating point activity in technical workloads Unlike the small instruction working sets and tight loops of technical applications commercial applications execute fewer loop instructions and often use non looping branch instructions Because of their branching behavior and data access patterns commercial applications have been less able to effectively use the memory system of traditional workstation and server architectures The potential implication of these differences is profound computers optimized for technical workloads may not provide good performance for commercial applica Commercial applications are an important yet often overlooked workload with significantly different characteristics from technical workloads The potential impact of these differences is that computers optimized for technical workloads may not provide good performance for commercial applications and these applications may not fully exploit advances in processor design To evaluate these issues we use hardware counters to measure architectural features of a four processor Pentium Pro based server running a TPC C like workload on an Informix database We examine the effectiveness of out of order execution branch prediction speculative execution superscalar issue and retire caching and multiprocessor scaling We find that out of order execution superscalar issue and retire and branch prediction are not as effective for database workloads as they are for technical workloads such as SPEC We find that caches are effective at reducing processor traffic to memory even larger caches would be helpful to satisfy more data requests Multiprocessor scaling of this workload is good but even modest bus utilization degrades application memory latency limiting database throughput 1 Introduction Commercial applications are an important class of applications with a large installed base According to Dataquest commercial applications such as transaction processing and decision support database service file service media and email service print service and custom applications were the dominant applications run on server machines in 1995 and are projected to be the dominant server applications in 2000 25 Commercial applications comprised about 85 of the 1995 server market and are tions and these applications may not exploit advances in processors at the same rate as SPEC This problem is exacerbated by the fact that I O and memory system performance improvement rates lag far behind processor performance improvements As a result it is important for computer architects to consider a wide range of applications when designing and evaluating architectures especially those intended to be used in SMPs In this paper we use hardware counters to measure architectural features of a four processor Pentium Probased server running a commercial database executing a TPC C like workload We vary several hardware and firmware configuration parameters such as L2 cache size main memory bandwidth the number of processors and the number of outstanding bus transactions to evaluate hardware design trade offs We examine the efficiency of caching out of order execution branch prediction speculative execution superscalar issue and retire and multiprocessor scaling We find that overall e g database and operating system CPI is roughly five times higher than the theoretical minimum CPI for the architecture and much higher than the CPI of SPEC Resource and instruction related stalls comprise the majority of these cycles While out of order execution is somewhat effective at hiding memory hierarchy latency and other stalls it is less effective for database workloads than for SPEC The branch prediction algorithms and hardware support do not work nearly as well for database workloads Superscalar issue and retire is only marginally helpful for this workload Not surprisingly we found that caches are effective at reducing the processor traffic to memory Our data support the rule of thumb that doubling the L2 cache size gives about half the benefit seen from the previous doubling While larger caches are effective this benefit is not without consequences Coherence traffic in the form of cache misses to dirty data in other processors caches increases as caches get bigger and as the number of processors increases We find that the exclusive state of the four state MESI cache coherence protocol is under utilized and could likely be omitted in favor of a simpler three state protocol Finally multiprocessor scaling of this
View Full Document
Unlocking...