Stanford EE 392C - Lecture 11 - On-line Profiling Techniques

Unformatted text preview:

EE392C: Advanced Topics in Computer Architecture Lecture #11Polymorphic ProcessorsStanford University Handout Date ???On-line Profiling TechniquesLecture #11: Tuesday, 6 May 2003Lecturer: Shivnath Babu, David Bloom, Rohit GuptaScribe: John Whaley, Jayanth Gummaraju1 IntroductionOn-line profiling refers to a technique for collecting run-time information of a programon the fly in order to decrease the execution time of the program. Information such asbasic block frequencies, branch behavior, memory access patterns, and so on, is collectedduring run time. This information is used by a virtual machine to optimize the programon the fly.The amount of hardware and software effort in using profile information can varysubstantially depending on the implementation. On one end, the profile information canbe exclusively used by a dynamic compiler to perform all the optimizations. On theother end, a dedicated coprocessor can be used to use the profile information to reducethe execution time of the program. In this report, we discuss two papers that use profileinformation extensively. Firstly, we discuss TEST[1], a Tracer for Extracting SpeculativeThreads in Hydra. TEST uses abundant hardware support to exploit the profile infor-mation. Secondly, we discuss Relational Profiling[2] which uses queries (assembly likeinstructions) and largely software support to optimize programs.The rest of the report is organized as follows. Section 2 gives a brief summary ofTEST and Section 3 presents a brief summary of Relational Profiling. Finally, Section 4discusses several issues about online profiling that were discussed during the class.2 TEST: A Tracer for Extracting Speculative ThreadsTEST (Tracer for Extracting Speculative Threads) provides a hardware mechanism foranalyzing sequential programs with the goal of locating regions with potential thread-levelspeculation (TLS). This paper presents TEST and shows how it can be used with Hydra,a CMP with built-in TLS support, in the Jrpm (Java Runtime Parallelizing Machine) toprovide on-line profile data to mark candidate regions of code for dynamic recompilationinto speculative threads.The current Jrpm system uses TEST to identify loop level parallelism. The two mainanalyses it performs are load dependency analysis and speculative state overflow anlysis.The load dependency analysis determines dependency arcs between loop iterations by2 EE392C: Lecture #11comparing timestamps on stores and loads to determine if a given STL (speculativethread loop) has dependencies to earlier threads. The speculative state overflow analysisis used to determine if a given STL would be able to fit in the speculation hardwareelements. Using the results of these two analyses, speculative threads are chosen basedon greatest expected speedup and least likelihood to overflow the speculation hardware.The hardware implementation of TEST consists of three main components. First,the dynamic compiler must insert annotation instructions into the code, which allowimportant events to be communicated with the hardware banks. The second component isthe hardware comparator banks, which contains the hardware to perform the timestampcomparisons to calculate the critical arcs and the state overflow analyses and store theresults into counters. One comparator bank is used to trace on STL, and an array ofcomparator banks allows for multiple STLs to be traced concurrently. Finally, the storebuffers that are used to hold writes during the speculative execution are used during theprofiling to hold the timestamp values needed for analysis.This paper found that the actual speedup achieved with the STLs chosen by TESTclosely matched the predicted speedup. The relative speedup is most important (ratherthan the absolute speedup) when choosing threads to execute speculatively, and TESTdid a good job of this in the benchmarks that were run on it. The accuracy and pre-dictability of TEST show promising results for the use of on-line profiling for extractingTLS.3 Relational Profiling: Enabling Thread Level Par-allelism in Virtual MachinesThis paper discusses hardware techniques based on a co-designed virtual machine forprofiling. They propose a relational profiling architecture (RPA) (assembly language andrequired hardware) and corresponding relational profiling model (RPM).The RPM consists of two basic queries: Instruction-based queries, where all eventsrelated to a certain instruction are recorded and Event based queries, where all instruc-tions related to a certain event are recorded. In addition, they propose to support hybridqueries. Each query, defined in the RPA assembly language, contains 4 pieces of infor-mation: records of information to be collected, the rate of collection, selection criteriaapplied to records and action to be taken. The type of information collected can be eitherarchitectural (PC, Thread ID, operand values) or implementation (fetch/dispatch/issuerates, latency, branch outcome). Actions communicate the information to the VM (e.g.through messages).The hardware implementation includes a Profile Control Table that stores the PCof the query instruction and the information that is to be collected and is set by theunderlying VM. The information collected from the processor pipeline is passed on tothe Query Engine, which itself is a 4-stage pipeline that performs the comparisons andactions specified by the query. The limit on the number of instructions that can beEE392C: Lecture #11 3profiled simultaneously is set by the two factors: number of interconnection networks(between the profiling pipeline and the ’main’ superscalar pipeline) and the size of theprofile buffers that store collected data.The paper goes on to evaluate the ideal number of each of these elements; It concludesthat 4 interconnect networks and 8 buffers meet the demand of most applications —having these many resources causes ¡2% stalls due to profiling.The strength of this paper is that it proposes an architecture that is easily extensibleto more functions - by simply adding more functionality at the VM level. The drawbacksof this paper are that the query sets allowed limit the types of information that can becollected. Further, there is no dynamic mechanism to increase or decrease frequency ofprofiles during the execution of a program (the authors suggest this as a future extension).Another possibility would be to consider using generic MIPS-style cores in place of theprofiling


View Full Document

Stanford EE 392C - Lecture 11 - On-line Profiling Techniques

Download Lecture 11 - On-line Profiling Techniques
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 11 - On-line Profiling Techniques and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 11 - On-line Profiling Techniques 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?