Rice ELEC 525 - Process Switches and Branch Prediction Accuracy - D462455

Home> Schools> Rice University> Dissertation/Thesis Submission (ELEC) > ELEC 525> Process Switches and Branch Prediction Accuracy

Rice ELEC 525 - Process Switches and Branch Prediction Accuracy

School name Rice University

Course Elec 525- Advanced Microprocessor Architecture

Pages 9

Download Save

Unformatted text preview:

- 1 - Process Switches and Branch Prediction Accuracy David Chen, Bennett Lau, Jeffrey Shafer Department of Electrical and Computer Engineering Rice University Abstract There were several hypotheses that motivated this research project. First, we proposed that running multiple processes pollutes the history in branch predictors due to aliasing. This pollution decreases accuracy. Second, we argued that branch prediction accuracy can be improved over existing techniques by removing this aliasing. Specifically, the history table can be partitioned to allow each process, or a subset of the most frequently executed processes, to have their own history. This project successfully shows that aliasing in the branch predictor history does occur due to multiple processes executing, and quantifies the frequency of that problem. The accuracy of existing predictors can be improved, and adding separate history tables is an effective method of removing the aliasing problem. Specifically, a prediction improvement of 0.5 – 3% can be obtained through this technique, and this improvement is always in addition to any prediction algorithm that suffers from aliasing. The actual accuracy penalty due to aliasing is much higher than this improvement would seem to suggest. The relative infrequency of process switches, however, and the short time of approximately 2000 branches to “repair” the destructive aliasing in the branch history table after a process switch, both combine to significantly limit the potential for improvement via this technique. When the hardware overhead of multiple history tables is taken into account, this technique is not recommended for general-purpose applications 1. Introduction The branch predictor accuracy is a significant component of overall performance in modern deeply pipelined superscalar machines. The penalty of a mispredicted branch can be quite high [6]. Most existing research into branch predictors, however, examines their performance in single-process programs only. As computation moves increasingly to multi-threaded applications, it is unclear if the existing research is applicable. Do multiple threads compete over the branch predictor and interfere with each other? How severe is this effect? Does a method exist to alleviate this problem? 2. Existing Work Previous research into branch predictors defines some useful terminology. First, aliasing, as described by [9], is central to this research project. Analogous to memory caches, aliasing in branch prediction occurs when different branches are assigned to the same history counter. These instruction streams can be from the same process or different processes altogether. Aliasing is not a permanent effect. If one branch executes for a thousand times, and then another branch aliases to its same spot in the history table and also executes a thousand times, the overall effect on prediction accuracy will be very limited, as it will only take a few cycles (depending on the depth of the history) for the new branch to “train” the predictor with the new pattern. This aliasing can either be constructive, destructive, or neutral, depending if the new pattern alters the final prediction of branch taken or branch not taken. [9] showed that destructive aliasing is significantly more common than constructive aliasing, particularly if the instruction stream is large or the branch history table is small. However, this analysis was not performed for independent processes. Rather, it was done on different segments of the instruction stream from the same process, so it is unclear if the behavior will be similar. Related to aliasing, [2] defined training overhead. This overhead refers to the fact that history counters must be “primed” to deliver the correct prediction by first observing a few repetitions of the branch. This concept is relevant because each process switch incurs a new training overhead for the branch predictor. Branch predictor configurations for simultaneous multithreaded processors were explored in [5]. In addition to a traditional shared predictor, they examined providing separate history registers or branch history tables for each thread. They concluded that providing only a separate history register for each thread increased the prediction accuracy by eliminating harmful aliasing between execution threads, and minimized the increase in hardware requirements that completely independent predictors would require. Further, this configuration with- 2 - independent predictors increased performance even when the threads were executing the same code. Unfortunately, while the branch prediction accuracy was increased, the overall system performance was only marginally affected, because the thread-level parallelism inherent in the SMT design was effective in hiding the negative effects of branch misprediction. [2] found that including kernel references increased the branch predictor history aliasing. This effect could alter the conclusions of past studies of prediction accuracy, such that algorithms with short branch histories would now performance better than algorithms with long histories, so that the effect of aliasing would be more rapidly overcome through training. 3. Multiple Process Instruction Traces Because most existing literature on branch predictors only used single-process instruction traces, this study was conducted with multi-process instruction traces. To obtain the desired instruction traces of multi-process applications and the host operating system, the Simics simulator from Virtuatech was used. Simics is a detailed functional simulator for multiple processor architectures, including x86, PowerPC, Alpha, and Itanium. In addition to simulating the microprocessor, Simics also emulates a core subset of peripheral devices, such as the BIOS, video card, network card, and disk drive. Thus, it is possible to install and run unmodified operating systems such as Linux or Windows on Simics. Because the entire system is simulated, and thus the simulation rate can be dynamically altered, Simics enables non-intrusive profiling of the instruction stream and memory access stream without concern that the monitoring overhead will distort the data capture [3]. Simics includes a standard module that can capture an instruction stream during program execution. This stream mixes both user-level and kernel-level code, as well as multiple processes. For this project, the goal was to extend

View Full Document


School:
Email:
New Password:
Confirm Password:

Rice ELEC 525 - Process Switches and Branch Prediction Accuracy

Sign up for free to view:

Please select your school