View Full Document

Effectiveness and Limitations of Embedded Counter Based Performance Analysis



View the full content.
View Full Document
View Full Document

22 views

Unformatted text preview:

CARNEGIE Department of Electrical MELLON and Computer Engineering Effectiveness and Limitations of Embedded Counter Based Performance Analysis Carey K Kloss 1997 Effectiveness and Limitations of Embedded Counter Based Performance Analysis Carey Ko Kloss 1997 Advisor Prof Shen Effectiveness and Limitations of Embedded Counter Based Performance Analysis Carey Kloss John Paul Shen Abstract This paper presents an experimentalstudy on the performanceof the Intel PentiumPro microprocessor using embedded performancecounters Thecounters enable detailed run time analysis of branchingand memory subsystemperformance and are accessed through a customdesignedtool The study uses Windows NTand realistic benchmarksincluding BAPco sSysmark32suite the Ziff Davis Winstone97PCBenchmarks and selected Spec95integer benchmarks The results showthat the PentiumPro memory subsystemand branchprediction are performingwell but the UOPper cycle and IPC numbersare low ThePentiumPro counters are not able to record enough informationabout the core of the CPUto analyze the reasons behindthe low UPC but wemakea conjecturethat the data hazardspresent betweeninstructions andUOPsare the causefor the poor performance Awishlist of countereventsthat wouldenablea morecompleteanalysis is provided 1 0 Introduction As microprocessors continue to develop from application specific machines to enormous general purpose systems comprehensive performance analysis is becoming increasingly Post silicon performance analysis is crucial more important to the design of future generation microprocessors Existing silicon allows the architect to run vast numbersof instructions through real systems providing true run time execution of application software However post silicon analysis has always been limited by the lack of visibility into the architecture provided by the pins In the past architects used logic analyzers to gather traces of pin behavior between the microprocessor and memoryor between modules in multiple chip microprocessors provide insight into the behavior of systems but left This method seemed to more questions than answers Clark85 Clark88 It also misled the architect during analysis of systems due to a lack of visibility explanations of the observed behavior Morerecent research addresses the visibility using detailed microarchitecture si nulators into the problem by to reach into the behavior of microprocessors Diep95 o Unfortunately these simulators are severely limited by the number of instructions that can be simulated in a given amount of time and it has also been shownthat these microarchitecture models are not as accurate as the designer maysuspect Black96 Noneof these methods can provide a complete analysis of post silicon microprocessors Pentium ProPerformance Analysis 1 Thecurrent generation of microprocessorsprovide event counters built directly into the hardware These counters allow the architect to count events inside the microarchitecture providing great insight into the dynamicbehavior of a software application on the hardware Cvet94 Chert96 used such on chip event counters to successfully measureoperating system impacts on hardware and memorysubsystem behavior This paper discusses the use of these counters on an Intel Pentium Pro system Wepropose and walk through a detailed analysis methodologybased on the event counters and present count results that reveal the bottlenecks of the PentiumPro beyondthe memorysubsystem Section 2 0 starts the discussionwith a description of the hardwareplatformused in this analysis It includes a description of the PentiumPro its counters and the tool we developedto gain access to these counters Section 3 0 outlines the different benchmarksused in this analysis and the motivations for their selection Section 4 0 presents the results of our counter analysis and Section 5 0 discusses an ideal set of counters that wouldallow a morecompletemicroarchitectural analysis Section 6 0 concludes with a discussion of the effectiveness of this type of microarchitectural analysis and a discussion of the bottlenecks we found in the PentiumPro microprocessor 2 0 Hardware Platform This workutilizes the Intel PentiumPro microprocessorand its built in event counters A brief description of the PentiumPro is providedin Section 2 1 For a moredetailed description refer to Colwel196 the Intel website Inte196a or the PentiumPro user manual Inte196b The hardware test platform is a 200 MHzPentium Pro PC with 64MBof non interleaved 60ns fast page modeRAM a 1 6GB EIDEdisk drive and an ATI WinTurbographics card with 2MBof YRAM I developed a custom WindowsNTapplication to control counter access and it is discussed in Section 2 3 Section 2 2 outlines the counter events that are used in the analysis in Section 4 0 and Section 2 4 discusses somecaveats found in the data collection process Pentium Pro Performance Analysis 2 2 1 The Pentium Pro ThePentiumPro microprocessorutilizes a superpipelined superscalar microarchitecture It sandwiches an out of order execution core betweenin order fetch and in order completionstages Figure 1 providesa high level illustration of the microarchitecture DCU L1 I Cache L2 Bus Interface SAGU x2 Port 2 IFU Inst Fetch L1 I Cache LAGU INT FP Port Inst DecodeUnit Alignment Insts Uops Register Alloc BTB ROB RRF In Order Completion Figure 1 PentiumPro Microarchitecture The PentiumPro has several units that sequenceinstructions through the microarchitecture Instruction Fetch Unit IFU The IFU includes an 8KBLevel 1 instruction cache and next fetch address generation hardware It fetches one cache line at a time and sends it to the instruction decodeunit Instruction DecodeUnit IDU The instruction decode unit handles all branch prediction resourceallocation renameregisters reservation stations slots and reorder buffer entries and decodes the x86 instructions into RISClike micro ops UOPs The IDUis capable of decoding three simple x86 instructions simultaneousiy or one complexinstruction and two simple ones If a simple decoderreceives a complexinstruction there is a stall Branchprediction is PentiumPro PerformanceAnalysis 3 performed by a 512 entry two level adaptive branch target buffer BTB If a branch is mispredicted there is a penalty of 9 15 cycles while the CPUis flushed If a branch is taken and predicted correctly there is a one cycle penalty Reservation Station RS The centralized reservation station can store 20 UOPsand has 5 read and 5 write ports that support multiple functional


Access the best Study Guides, Lecture Notes and Practice Exams

Loading Unlocking...
Login

Join to view Effectiveness and Limitations of Embedded Counter Based Performance Analysis and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Effectiveness and Limitations of Embedded Counter Based Performance Analysis and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?