Alpha 21264 Microarchitecture21264 Overview21264 Fetch Unit21264 Dispatch and Execution21264 Memory SystemOut-of-order executionPowerPoint Presentation21264 Prediction Mechanisms21264 Execution UnitsAlpha 21264 MicroarchitectureKenneth Conley6.8939/14/0021264 Overview•64-bit RISC Processor•500-1000 Mhz•7-stage pipeline•15 million transistors•2.2V, 60W•310 mm2 (.35 micron)•Target apps: Internet servers, data warehousing, digital video, speech recognition21264 Fetch Unit•4 instructions/cycle, speculative•Prediction:–Line/way predictor for each icache line (2-way, 64K)–3 branch prediction mechanisms•Local: 2 level, 10-bit history pattern predictor (e.g. 10101010)•Global: History of last 12 branches, 4096 entry, 2-bit saturation•Chooser: Chooses between local/global–Prediction tables: 3.6KB–Targets: 6 KB–90-100% accurate on most benchmarks21264 Dispatch and Execution•4 integer execution units (2 clusters)–Each maintains copy of 80-entry register file–Single cycle latency for basic integer ops–Integer population count/leading zero count–Fully-pipelined multiplier–Motion Video Instructions (MVI)•2 FP execution units (1 cluster):–Upper: Multiply–Lower: Add, IEEE Divide, SQRT–72-entry RF21264 Memory System•2, 64-bit data buses for icache/dcache•32 in-flight loads, 32 in-flight stores•Dcache increased to 64K (2-way), double-pumped•L2 Cache:–Moved off-chip (increased latency by 6)–4 GB/s sustained bandwidth•Speculative issue consumers of loads for 3 cycle integer load hit latency•1.3 GB/s sustained bandwidth on McCalpin StreamOut-of-order execution•User visible registers: 32 int/32 float•Renaming registers: 41 int/41 float•Renaming map data saved for precise exception handling•80 instruction in-flight window, in-order retirement•Loads can speculatively bypass stores–Store wait bits for mis-speculation21264 Prediction Mechanisms21264 Execution
View Full Document