DOC PREVIEW
Effectiveness and Limitations of Embedded Counter Based Performance Analysis

This preview shows page 1-2-21-22 out of 22 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CARNEGIE MELLONDepartment of Electrical and Computer Engineering~Effectiveness and Limitations ofEmbedded Counter BasedPerformance AnalysisCarey K. Kloss1997Effectiveness and Limitations ofEmbedded Counter BasedPerformance AnalysisCarey Ko Kloss1997Advisor: Prof. ShenEffectiveness and Limitations of EmbeddedCounter Based Performance AnalysisCarey Kloss, John Paul ShenAbstractThis paper presents an experimental study on the performance of the Intel Pentium Pro micropro-cessor using embedded performance counters. The counters enable detailed run-time analysis ofbranching and memory subsystem performance, and are accessed through a custom designed tool.The study uses Windows NT and realistic benchmarks including BAPco’s Sysmark32 suite, theZiff-Davis Winstone97 PC Benchmarks, and selected Spec95 integer benchmarks. The resultsshow that the Pentium Pro memory subsystem and branch prediction are performing well, but theUOP per cycle and IPC numbers are low. The Pentium Pro counters are not able to record enoughinformation about the core of the CPU to analyze the reasons behind the low UPC, but we make aconjecture that the data hazards present between instructions and UOPs are the cause for the poorperformance. A wish list of counter events that would enable a more complete analysis is provided.1.0 IntroductionAs microprocessors continue to develop from application specific machines to enormous generalpurpose systems, comprehensive performance analysis is becoming increasingly more important.Post silicon performance analysis is ,crucial to the design of future generation microprocessors.Existing silicon allows the architect to run vast numbers of instructions through real systems, pro-viding true run-time execution of application software. However, post silicon analysis has alwaysbeen limited by the lack of visibility into the architecture provided by the pins.In the past, architects used logic analyzers to gather traces of pin behavior between the micropro-cessor and memory or between modules in multiple chip microprocessors. This method seemed toprovide insight into the behavior of systems, but left more questions than answers [Clark85,Clark88]. It also misled the architect during analysis of systems, due to a lack of visibility into theexplanations of the observed behavior. More recent research addresses the visibility problem byusing detailed microarchitecture si~nulators to reach into the behavior of microprocessors[Diep95]o Unfortunately, these simulators are severely limited by the number of instructions thatcan be simulated in a given amount of time, and it has also been shown that these microarchitec-ture models are not as accurate as the designer may suspect [Black96]. None of these methods canprovide a complete analysis of post silicon microprocessors.Pentium Pro Performance Analysis1The current generation of microprocessors provide event counters built directly into the hardware.These counters allow the architect to count events inside the microarchitecture, providing greatinsight into the dynamic behavior of a software application on the hardware. [Cvet94, Chert96]used such on chip event counters to successfully measure operating system impacts on hardwareand memory subsystem behavior.This paper discusses the use of these counters on an Intel Pentium Pro system. We propose andwalk through a detailed analysis methodology based on the event counters, and present countresults that reveal the bottlenecks of the Pentium Pro beyond the memory subsystem. Section 2.0starts the discussion with a description of the hardware platform used in this analysis. It includes adescription of the Pentium Pro, its counters, and the tool we developed to gain access to thesecounters. Section 3.0 outlines the different benchmarks used in this analysis and the motivationsfor their selection. Section 4.0 presents the results of our counter analysis, and Section 5.0 dis-cusses an ideal set of counters that would allow a more complete microarchitectural analysis. Sec-tion 6.0 concludes with a discussion of the effectiveness of this type of microarchitecturalanalysis, and a discussion of the bottlenecks we found in the Pentium Pro microprocessor.2.0 Hardware PlatformThis work utilizes the Intel Pentium Pro microprocessor and its built-in event counters. A briefdescription of the Pentium Pro is provided in Section 2.1. For a more detailed description refer to[Colwel196], the Intel web site [Inte196a] or the Pentium Pro user manual [Inte196b]. The hard-ware test platform is a 200 MHz Pentium Pro PC with 64MB of non-interleaved 60ns fast pagemode RAM, a 1.6GB EIDE disk drive, and an ATI WinTurbo graphics card with 2MB of YRAM.I developed a custom Windows NT application to control counter access, and it is discussed inSection 2.3. Section 2.2 outlines the counter events that are used in the analysis in Section 4.0,and Section 2.4 discusses some caveats found in the data collection process.Pentium Pro Performance Analysis 22.1 The Pentium ProThe Pentium Pro microprocessor utilizes a superpipelined superscalar microarchitecture. It sand-wiches an out-of-order execution core between in-order fetch and in-order completion stages. Fig-ure 1 provides a high-level illustration of the microarchitecture.L2 ,~__DCU-- L1 I CacheBus InterfaceIFU (Inst. Fetch)-- L1 I CacheInst. Decode Unit--Alignment-- Insts-> Uops-- Register Alloc-- BTB~SAGU(x2~Port 2~LAGUPortINT & FP ~ROB & RRF-- In-Order CompletionFigure 1: Pentium Pro MicroarchitectureThe Pentium Pro has several units that sequence instructions through the microarchitecture:¯Instruction Fetch Unit (IFU): The IFU includes an 8KB Level 1 instruction cache and nextfetch address generation hardware. It fetches one cache line at a time and sends it to theinstruction decode unit.¯Instruction Decode Unit (IDU): The instruction decode unit handles all branch prediction,resource allocation (rename registers, reservation stations slots, and reorder buffer entries), anddecodes the x86 instructions into RISC like micro-ops (UOPs). The IDU is capable of decod-ing three simple x86 instructions simultaneousiy or one


Effectiveness and Limitations of Embedded Counter Based Performance Analysis

Download Effectiveness and Limitations of Embedded Counter Based Performance Analysis
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Effectiveness and Limitations of Embedded Counter Based Performance Analysis and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Effectiveness and Limitations of Embedded Counter Based Performance Analysis 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?