Stanford EE 392C - Lecture 11 - On-line Profiling Techniques - D2906184

Home> Schools> Stanford University> Electrical Engineering (EE) > EE 392C> Lecture 11 - On-line Profiling Techniques

Stanford EE 392C - Lecture 11 - On-line Profiling Techniques

Course Ee 392c- Advanced Topics in Computer Architecture

Pages 6

Download Save

Unformatted text preview:

EE392C: Advanced Topics in Computer Architecture Lecture #11Polymorphic ProcessorsStanford University Handout Date ???On-line Proﬁling TechniquesLecture #11: Tuesday, 6 May 2003Lecturer: Shivnath Babu, David Bloom, Rohit GuptaScribe: John Whaley, Jayanth Gummaraju1 IntroductionOn-line proﬁling refers to a technique for collecting run-time information of a programon the ﬂy in order to decrease the execution time of the program. Information such asbasic block frequencies, branch behavior, memory access patterns, and so on, is collectedduring run time. This information is used by a virtual machine to optimize the programon the ﬂy.The amount of hardware and software eﬀort in using proﬁle information can varysubstantially depending on the implementation. On one end, the proﬁle information canbe exclusively used by a dynamic compiler to perform all the optimizations. On theother end, a dedicated coprocessor can be used to use the proﬁle information to reducethe execution time of the program. In this report, we discuss two papers that use proﬁleinformation extensively. Firstly, we discuss TEST[1], a Tracer for Extracting SpeculativeThreads in Hydra. TEST uses abundant hardware support to exploit the proﬁle infor-mation. Secondly, we discuss Relational Proﬁling[2] which uses queries (assembly likeinstructions) and largely software support to optimize programs.The rest of the report is organized as follows. Section 2 gives a brief summary ofTEST and Section 3 presents a brief summary of Relational Proﬁling. Finally, Section 4discusses several issues about online proﬁling that were discussed during the class.2 TEST: A Tracer for Extracting Speculative ThreadsTEST (Tracer for Extracting Speculative Threads) provides a hardware mechanism foranalyzing sequential programs with the goal of locating regions with potential thread-levelspeculation (TLS). This paper presents TEST and shows how it can be used with Hydra,a CMP with built-in TLS support, in the Jrpm (Java Runtime Parallelizing Machine) toprovide on-line proﬁle data to mark candidate regions of code for dynamic recompilationinto speculative threads.The current Jrpm system uses TEST to identify loop level parallelism. The two mainanalyses it performs are load dependency analysis and speculative state overﬂow anlysis.The load dependency analysis determines dependency arcs between loop iterations by2 EE392C: Lecture #11comparing timestamps on stores and loads to determine if a given STL (speculativethread loop) has dependencies to earlier threads. The speculative state overﬂow analysisis used to determine if a given STL would be able to ﬁt in the speculation hardwareelements. Using the results of these two analyses, speculative threads are chosen basedon greatest expected speedup and least likelihood to overﬂow the speculation hardware.The hardware implementation of TEST consists of three main components. First,the dynamic compiler must insert annotation instructions into the code, which allowimportant events to be communicated with the hardware banks. The second component isthe hardware comparator banks, which contains the hardware to perform the timestampcomparisons to calculate the critical arcs and the state overﬂow analyses and store theresults into counters. One comparator bank is used to trace on STL, and an array ofcomparator banks allows for multiple STLs to be traced concurrently. Finally, the storebuﬀers that are used to hold writes during the speculative execution are used during theproﬁling to hold the timestamp values needed for analysis.This paper found that the actual speedup achieved with the STLs chosen by TESTclosely matched the predicted speedup. The relative speedup is most important (ratherthan the absolute speedup) when choosing threads to execute speculatively, and TESTdid a good job of this in the benchmarks that were run on it. The accuracy and pre-dictability of TEST show promising results for the use of on-line proﬁling for extractingTLS.3 Relational Proﬁling: Enabling Thread Level Par-allelism in Virtual MachinesThis paper discusses hardware techniques based on a co-designed virtual machine forproﬁling. They propose a relational proﬁling architecture (RPA) (assembly language andrequired hardware) and corresponding relational proﬁling model (RPM).The RPM consists of two basic queries: Instruction-based queries, where all eventsrelated to a certain instruction are recorded and Event based queries, where all instruc-tions related to a certain event are recorded. In addition, they propose to support hybridqueries. Each query, deﬁned in the RPA assembly language, contains 4 pieces of infor-mation: records of information to be collected, the rate of collection, selection criteriaapplied to records and action to be taken. The type of information collected can be eitherarchitectural (PC, Thread ID, operand values) or implementation (fetch/dispatch/issuerates, latency, branch outcome). Actions communicate the information to the VM (e.g.through messages).The hardware implementation includes a Proﬁle Control Table that stores the PCof the query instruction and the information that is to be collected and is set by theunderlying VM. The information collected from the processor pipeline is passed on tothe Query Engine, which itself is a 4-stage pipeline that performs the comparisons andactions speciﬁed by the query. The limit on the number of instructions that can beEE392C: Lecture #11 3proﬁled simultaneously is set by the two factors: number of interconnection networks(between the proﬁling pipeline and the ’main’ superscalar pipeline) and the size of theproﬁle buﬀers that store collected data.The paper goes on to evaluate the ideal number of each of these elements; It concludesthat 4 interconnect networks and 8 buﬀers meet the demand of most applications —having these many resources causes ¡2% stalls due to proﬁling.The strength of this paper is that it proposes an architecture that is easily extensibleto more functions - by simply adding more functionality at the VM level. The drawbacksof this paper are that the query sets allowed limit the types of information that can becollected. Further, there is no dynamic mechanism to increase or decrease frequency ofproﬁles during the execution of a program (the authors suggest this as a future extension).Another possibility would be to consider using generic MIPS-style cores in place of theproﬁling

View Full Document


School:
Email:
New Password:
Confirm Password:

Stanford EE 392C - Lecture 11 - On-line Profiling Techniques

Sign up for free to view:

Please select your school