This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Title - Memory Access LatencyHeading2 - Goal:Heading2 - Basic Idea:Heading2 - The Right Software:Heading2 - Timeline of a Typical Memory Request:Heading2 - The Pentium Pro Bus ProtocolHeading2 - What We Will SeeHeading2 - ADS# and DRDY#Heading2 - Misc Demonstrations:Memory Access LatencyGoal:To measure via software and verify with digital logic analyzer, the access latency of amemory transaction request. Basic Idea:Memory access latency from the processor to DRAM, then back to the processor takes onthe order of ~100ns for a processor that utilizes the offchip memory controller.We can measure memory access latency with a pre-processor attached to a logic analyzer,which is an adaptor that sits on the processor bus interface and observes the traffic that passesthrough the processor bus.Unfortunately, we don’t have a pre-processor/adaptor available at our disposal. The lack ofpre-processors means that we can’t track all of the address/data signals and see exactly whichtransaction took how long. However, even without the preprocessor we can still measure thememory access latency. All we have to do is to have the right software, and probe the right sig-nals on the processor bus.L1cacheL2cacheDTLBProcessor CoreBIU (Bus Interface U nit)RAM[A1][B8][A4][A2][A3][B4][B3][B2][B1]I/O to memory trafficmemory request schedulingphysical to memory addrmapping[B7][B5]readdatabuffersystem controllerprocessorDRAM core[B6]Part A: Searchingon-chip for dataPart B: Goingoff-chip for data Figure 1: Progression of a Memory Read Transaction Request Through Memory SystemThe Right Software:I have written a simple pointer chasing benchmark. The spirit of the code arises out of thepopular STREAM benchmark and lmbench, respectively. The benchmark simply initializes alarge (>32 MB) array of data, A[], in such a manner that the data of array element A[i] containsthe address of array element A[j], where A[j] is located in a different cache block. In essence, thearray is nothing more than a large linked list where each element of the linked list is the addressfor the next element in the linked list. In this manner, each linked list traversal will cost a level ofmemory indirection from one cache block to another cache block. We can then make use of acoarse grained system timer (or even the fine grained RDTSC) and divide through by the numberof linked list traversals and arrive at a figure for the time spent per traversal.Timeline of a Typical Memory Request:In the diagram below, we describe the chain of events, starting from the initiation of a loadinstruction to its completion in a processor pipeline, that occurs while obtaining data from thememory subsystem. This chain of events applies to processors employing either in-order or out-of-order instruction execution, utilizing the Intel Pentium Pro(tm) style bus protocol. The diagramstart_timer();next_address == (int **) *next_address;stop_timer();Do a million of theseFigure 2: Pseudo-code for Linked List TraversalFetchDecodeWBMemExecvirtual to physical address translation(DTLB access)L1 D-Cacheaccess. If missthen proceed toL2 Cacheaccess. If missthen send to BIUBus Interface Unit (BIU)obtains data from mainmemoryBIU arbitrates forownership of busrequest sentto all cpus and chipseterror checksto ensure goodrequestother processorscheck address for cache coherence memory controllerobtains data from DRAM arraydata returns from memory controller to BIUStages of instruction executionProceeding throughthe cache hierarchy ina modern CPUPentium Prostyle busprotocolsetup row address setup col. address data acquisition mem. control.accessFigure 3: Execution of a load instruction in a modern processorillustrates an abstract processor model. Some processor implementations may contain additionalload-store buffers or write caches, utilize virtually indexed caches, or access the DTLB in parallelwith cache accesses. However, although such architecture-specific details may change the numberof memory transaction requests on the bus or affect the latency of load requests that stays on chip,they do not affect the conclusion drawn on the average latency of load requests that must go offchip and utilize the processor bus, and will therefore be ignored for the sake of simplification. In this example, a load request first goes through the DTLB, and the address of the loadinstruction is translated from the virtual address to physical address. The load request then pro-ceeds to the L1 data cache, and if the address of the load request is matched in the L1 cache tagarray, the load request may be satisfied at this point, and no further action is needed. However, ifthe load request missed the L1 cache, the request then proceeds to the L2 cache. The same processof searching through the L2 cache follows. Finally, if the requested data cannot be found in theprocessor caches, the load request is then passed to the Bus Interface Unit (BIU) to fetch datafrom the off-chip memory.The Pentium Pro Bus ProtocolIn this section, we give a brief overview of the Intel Pentium Pro bus protocol. The Pen-tium Pro bus protocol was first utilized on the Pentium Pro processor, and runs at a frequnecy of66 MHz. The same bus protocol with minor changes to the signalling specifications was laterdeployed by Intel on the Pentium II and Pentium III series of processors with higher operating fre-quencies at 100 and 133 MHz. The progression of various phases of the bus protocol may beexplained as follows. All transactions initiated on the Pentium Pro bus proceed through a succes-sion of six phases from inception to completion. The six phases are: the Arbitration Phase, fol-lowed by the Request Phase, the Error Phase, the Snoop Phase, the Response Phase, then finallythe Data Phase in which the data may be sent or returned from the system controller. Some trans-actions consist of acknowledgment message to a prior transaction request and do not require theuse of the signals in the data phase. However, all of the transactions used in our simulationsrequire full utilization of signals in the data phase.In the illustration shown as figure 1, once the load request misses the processor caches, therequest is sent to the Bus Interface Unit (BIU). At this point, the BIU will enter in the ArbitrationPhase by asserting the appropriate signals onto the processor bus, and attempt to gain ownershipof the processor bus. If the arbitration for the processor bus is successful, the transaction will thenbe able to enter into the


View Full Document

UMD ENEE 759H - Memory Access Latency

Download Memory Access Latency
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Memory Access Latency and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Memory Access Latency 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?