DOC PREVIEW
Berkeley COMPSCI 152 - Lecture 13 – Cache I

This preview shows page 1-2-3-20-21-22-41-42-43 out of 43 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS 152 L13: Cache I UC Regents Fall 2006 © UCB2006-10-12John Lazzaro (www.cs.berkeley.edu/~lazzaro)CS 152 Computer Architecture and EngineeringLecture 13 – Cache Iwww-inst.eecs.berkeley.edu/~cs152/A cosmic ray hits a DRAM cell ...TAs: Udam Saini and Jue SunUC Regents Fall 2006 © UCBCS 152 L13: Cache IToday: Caches and the Memory SystemMemory Hierarchy: Technology motivation for caching.Locality: Why caching worksCache design: Final project component.DatapathMemoryProcessorInputOutputControlUC Regents Fall 2006 © UCBCS 152 L13: Cache I1977: DRAM faster than microprocessors Apple ][ (1977)Steve WozniakSteve Jobs CPU: 1000 ns DRAM: 400 nsUC Regents Fall 2006 © UCBCS 152 L13: Cache ISince then: technology scaling ...Circuit in 250 nm technology (introduced in 2000)H nanometers longSame circuit in 180 nm technology (introduced in 2003)0.7 x H nmEach dimension 30% smaller. Area is 50% smallerLogic circuits use smaller C’s, lower Vdd, and higher kn and kp to speed up clock rates.UC Regents Fall 2006 © UCBCS 152 L13: Cache IDRAM scaled for more bits, not more MHzAssume Ccell = 1 fFWord line may have 2000 nFet drains,assume word line C of 100 fF, or 100*Ccell.Ccell holds Q = Ccell*(Vdd-Vth)When we dump this charge onto the word line, what voltage do we see?dV = [Ccell*(Vdd-Vth)] / [100*Ccell]dV = (Vdd-Vth) / 100 ⋲ tens of millivolts! In practice, scale array to get a 60mV signal.UC Regents Fall 2006 © UCBCS 152 L13: Cache I1980-2003, CPU speed outpaced DRAM ...10DRAMCPUPerformance(1/latency)1001000198020001990YearGap grew 50% per yearQ. How do architects address this gap? A. Put smaller, faster “cache” memories between CPU and DRAM. Create a “memory hierarchy”.10000The power wall2005CPU60% per yr2X in 1.5 yrsDRAM9% per yr2X in 10 yrsUC Regents Fall 2006 © UCBCS 152 L13: Cache ICaches: Variable-latency memory ports Lower LevelMemoryUpper LevelMemoryTo ProcessorFrom ProcessorBlk XBlk YSmall, fast Large, slow FromCPUTo CPUData in upper memory returned with lower latency. Data in lower level returned with higher latency.DataAddressUC Regents Fall 2006 © UCBCS 152 L13: Cache ICache replaces data, instruction memoryrd1RegFilerd2WEwdrs1rs2wsDPCQ+0x4Addr DataInstrMemExtIRIRBAM32ALU3232opIRYMIRDoutData MemoryWEDinAddrMemToRegRMux,LogicIF (Fetch) ID (Decode) EX (ALU) MEMWBReplace with Instruction Cache and Data Cacheof DRAM main memoryUC Regents Fall 2006 © UCBCS 152 L13: Cache I Recall: Intel ARM XScale CPU (PocketPC)1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001Fig. 1. Process SEM cross section.The process was raised from [1] to limit standby power.Circuit design and architectural pipelining ensure low voltageperformance and functionality. To further limit standby currentin handheld ASSPs, a longer poly target takes advantage of theversus dependence and source-to-body bias is usedto electrically limit transistor in standby mode. All corenMOS and pMOS transistors utilize separate source and bulkconnections to support this. The process includes cobalt disili-cide gates and diffusions. Low source and drain capacitance, aswell as 3-nm gate-oxide thickness, allow high performance andlow-voltage operation.III. ARCHITECTUREThe microprocessor contains 32-kB instruction and datacaches as well as an eight-entry coalescing writeback buffer.The instruction and data cache fill buffers have two and fourentries, respectively. The data cache supports hit-under-missoperation and lines may be locked to allow SRAM-like oper-ation. Thirty-two-entry fully associative translation lookasidebuffers (TLBs) that support multiple page sizes are providedfor both caches. TLB entries may also be locked. A 128-entrybranch target buffer improves branch performance a pipelinedeeper than earlier high-performance ARM designs [2], [3].A. Pipeline OrganizationTo obtain high performance, the microprocessor core utilizesa simple scalar pipeline and a high-frequency clock. In additionto avoiding the potential power waste of a superscalar approach,functional design and validation complexity is decreased at theexpense of circuit design effort. To avoid circuit design issues,the pipeline partitioning balances the workload and ensures thatno one pipeline stage is tight. The main integer pipeline is sevenstages, memory operations follow an eight-stage pipeline, andwhen operating in thumb mode an extra pipe stage is insertedafter the last fetch stage to convert thumb instructions into ARMinstructions. Since thumb mode instructions [11] are 16 b, twoinstructions are fetched in parallel while executing thumb in-structions. A simplified diagram of the processor pipeline isFig. 2. Microprocessor pipeline organization.shown in Fig. 2, where the state boundaries are indicated bygray. Features that allow the microarchitecture to achieve highspeed are as follows.The shifter and ALU reside in separate stages. The ARM in-struction set allows a shift followed by an ALU operation in asingle instruction. Previous implementations limited frequencyby having the shift and ALU in a single stage. Splitting this op-eration reduces the critical ALU bypass path by approximately1/3. The extra pipeline hazard introduced when an instruction isimmediately followed by one requiring that the result be shiftedis infrequent.Decoupled Instruction Fetch. A two-instruction deep queue isimplemented between the second fetch and instruction decodepipe stages. This allows stalls generated later in the pipe to bedeferred by one or more cycles in the earlier pipe stages, therebyallowing instruction fetches to proceed when the pipe is stalled,and also relieves stall speed paths in the instruction fetch andbranch prediction units.Deferred register dependency stalls. While register depen-dencies are checked in the RF stage, stalls due to these hazardsare deferred until the X1 stage. All the necessary operands arethen captured from result-forwarding busses as the results arereturned to the register file.One of the major goals of the design was to minimize the en-ergy consumed to complete a given task. Conventional wisdomhas been that shorter pipelines are more efficient due to re-32 KB Instruction Cache32 KB Data Cache180 nm process (introduced 2003)UC Regents Spring 2005 © UCBCS 152 L14: Cache IUC Regents Fall 2006 © UCBCS 152 L13: Cache I2005 Memory Hierarchy: Apple iMac G5 iMac G51.6 GHz$1299.00RegL1 InstL1 DataL2DRAMDiskSize1K64K32K512K256M80GLatency(cycles)133111601e7Let programs address a memory space that scales to the


View Full Document

Berkeley COMPSCI 152 - Lecture 13 – Cache I

Documents in this Course
Quiz 5

Quiz 5

9 pages

Memory

Memory

29 pages

Quiz 5

Quiz 5

15 pages

Memory

Memory

29 pages

Memory

Memory

35 pages

Memory

Memory

15 pages

Quiz

Quiz

6 pages

Midterm 1

Midterm 1

20 pages

Quiz

Quiz

12 pages

Memory

Memory

33 pages

Quiz

Quiz

6 pages

Homework

Homework

19 pages

Quiz

Quiz

5 pages

Memory

Memory

15 pages

Load more
Download Lecture 13 – Cache I
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 13 – Cache I and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 13 – Cache I 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?