New version page

UCF CGS 3269 - Microprocessors

Documents in this Course
Load more
Upgrade to remove ads
Upgrade to remove ads
Unformatted text preview:

How A Pentium II Microprocessor WorksIn older microprocessor designs, the processor chip worked single-mindedly. It read an instruction from memory, carried it out step-by-step and then advanced to the next instruction. Each step required at least one clock cycle, so that execution of the entire instruction would take many clock cycles in total. Pipelining allows for many instructions to be “executed” in one clock cycle. In the paragraph and example below, we’ll illustrate how pipelining works.Before examining CISC and RISC architectures in detail we’ll first examine thedetails of how Pentium II and III processors work and then look at some of thetechniques they utilize in isolation.How A Pentium II Microprocessor WorksIn order to follow the next twelve numbered paragraphs better, look at the pictureof the Pentium II microprocessor in Appendix G. The Pentium III processoroperates in essentially the same fashion, the primary difference is the addition of asecond FPU which is dedicated to the MMX commands and Streaming I/O.1. The Pentium II microprocessor contains approximately 7.5 x 106 transistors inthe CPU and approximately 15 x 106 transistors in the separate L2 cache. ThisL2 cache contains 512 KB of storage built from off the shelf components. TheL2 cache is not part of the CPU but is built onto the same circuit board with theCPU. This circuit board (called a Multi-Chip Module or MCM) then plugsdirectly into the motherboard of the computer. While this design is cheaper tomanufacture, the trade-off is that data flows between the L2 cache and theCPU at only 1/2 the speed of the CPU. Thus if the CPU is clocked at 400 MHz,data travels to and from the L2 cache to the CPU at only 200 MHz. This speeddisadvantage is somewhat compensated for by doubling the L1 cache from 8Kto 16K in each of the I-cache and D-cache. This larger cache cuts roughly inhalf the time it takes to access memory, and provides faster access to the mostrecently used data and instructions2. Since the CPU is limited to moving data in and out of the CPU at the speed ofthe main data bus (the front side bus in Intel nomenclature) the Pentium IIextends the design philosophy begun with the Pentium Pro in that the L1 andL2 caches are designed to alleviate the effects of the bus bottleneck byminimizing the number of instances in which a clock cycle passes without theprocessor be able to complete an operation (blocked by the slowness of thebus). The Pentium II extends this philosophy primarily by doubling the size ofthe L1 cache.3. Information enters the CPU through the Bus Interface Unit (BIU). The BIUduplicates the information sending one copy to the pair of L1 caches and onecopy to the L2 cache. If the incoming data is an instruction it is sent to the I-cache and if it is data for an instruction it is sent to the D-cache.4. While the Fetch/Decode Unit is pulling instructions from the I-cache anothercomponent called the Branch Target Buffer (BTB) determines if a particularinstruction has been used before by comparing the incoming code with aCGS 3269 – CPUs and Microprocessors - 15record maintained in the separate look-aside buffer. The BTB is looking inparticular for instructions that involve branching (where the program beingexecuted can take different possible paths). If the BTB finds a branchinstruction, it predicts which path the program will take based upon what theprogram has done at similar branches. Intel's branch predictor units maintain asuccessful prediction rate of better than 90%.5. The fetch portion of the Fetch/Decode Unit continues to pull instructions (16bits at a time) from the cache in the order predicted by the BTB. Then 3decoders working in parallel break up the more complex instructions into uops(micro-operations) that the Dispatch/Execution Unit can process faster than themore complex instruction resident in the I-cache. Note that the three decodersare not identical units - two are called restricted decoders and can only decodeCISC instructions that each translate into a single uop, the other unit is called ageneral decoder (or complex decoder) and can handle CISC instructions thattranslate into four or fewer uops. All CISC instructions which referencememory must be decoded by the general decoder. If the CISC instructioninvolves more than four uops then it is sent to a special microcode instructionsequencer (MIS unit) which is not shown on our diagram. Programs that makemany memory references tend to frustrate the mulit-part decoder scheme ofthe Pentium II processor. Operating at maximum speed with code optimizedfor them, the three decoders can generate 6 uops/clock cycle (one from eachrestricted decoder and four from the general decoder with an average for allcode of about three uops/clock cycle). All uops in this architecture are 118 bitslong.6. The decode unit sends all uops to the ReOrder Buffer (ROB) (also called theInstruction Pool). This is a circular buffer, with a head and a tail, that containsthe uops in the order in which the BTB predicted that they would be needed.The ROB can store up to 40 entires each 254 bits long. Each entry in the ROBcontains the 118 bit uop plus two operands, the result and processorinformation that the uop might affect (status bits). The ROB can prepare up tothree uops/clock cycle for processing. All register renaming is handled by theROB. There are 40 such registers in the Pentium II (not shown on thediagram).7. As the decode unit passes uops to the ROB, it also sends them to a specialunit called the Reservation Station (RS) (not shown on the diagram but think ofit as inside the fetch/decode unit). The RS serves two purposes: (1) it is theconduit that passes uops to a suitable execution unit as one becomes availableand (2) it acts as another buffer storing up to 20 uops and their data. Thisbuffering effect prevents slowdowns in the decoders from starving theCGS 3269 – CPUs and Microprocessors - 16processors and also prevents the decoders from stalling when the processorsare fully engaged. The RS connects five ports linking to six execution stationsthat actually carry out the manipulations. 8. The Dispatch/Execute Unit checks each uop in the ROB to see if it has all


View Full Document
Download Microprocessors
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Microprocessors and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Microprocessors 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?