Rice ELEC 525 - Maximizing Resource Utilization

Unformatted text preview:

Grouped Prefetching: Maximizing Resource Utilization Weston Harper, Justin Mintzer, Adrian Valenzuela Rice University – ELEC525 Final Report Abstract Prefetching is a common method to prevent memory stalls, but it depends on the time sensitive return of prefetch requests relative to their consumption by an operation. Additionally, adding prefetches increases the contention on memory system buses which is counterproductive to ensuring the timely service of prefetches. By issuing prefetches as single grouped access through memory buses, bus contention can be reduced. However, this is improvement in bus contention over normal prefetching is only relevant for applications which place sufficient stress on the memory system to saturate the buses. This may be an increasingly common occurrence as processors continue to speed up faster then memory systems. 1 INTRODUCTION As the response time of memory relative to processor speed continues to degrade, more focus is being placed on attempting to avoid paying the increasingly large memory stalls. One group of methods that has been developed to address this issue is the prefetching of instructions or data into memory before they are actually called for by an operation. However, the ability of a prefetch to convert a hit into a miss is dependent on whether the prefetch is serviced before the information is required for an operation. There are many factors that affect how quickly a prefetch can be serviced, including the hit latency of structures in the memory hierarchy, memory bus speed, memory bus contention, and contention in the memory hierarchy with other memory requests that require servicing. The addition of many prefetches to a memory service stream can potentially have a negative impact on the ability of the memory system to service normal memory requests, as they are now forced to contend with the prefetches. The role of bus speed and contention in servicing prefetches would seem to indicate that better use of the memory hierarchy buses could improve the impact prefetches have on performance. Motivation First it will be useful to define some attributes that describe how well a prefetch system works. The coverage factor of a prefetcher is the fraction of cache misses that are changed to hits by prefetching. An unnecessary prefetch is one that does not need to be issued because the requested line is either already in the cache or a prefetch has already been issued to service that line. A useless prefetch is a prefetch that brings in a line that is evicted from the cache without ever being used. This is caused when something is prefetched and goes unused or when a prefetched line is brought in too early and evicted before it can be used. A useful prefetch is a prefetch that brings in a line that results in a cache hit before the line is replaced. These terms help describe how the prefetch distance, or the time between issuing a prefetch request and the use of a prefetch line, affect the coverage factor. A prefetch must be done early enough that it is ready to be used before a cache miss is 1registered, but not so early that it is evicted before being used and becoming a useless prefetch. Thus a critical factor in determining the effectiveness of a prefetcher is its timeliness. [1] Assuming that it is more difficult to ensure that prefetch arrives early enough to hide a miss than it is to prevent a prefetch from arriving to early and becoming useless, we can examine the memory hierarchy buses as a factor in the timeliness of a prefetch. Under this assumption, it would be desirable to minimize the number of memory requests competing for bus access so that queuing delays are decreased. It would also be desirable to maximize the number of memory requests that can be sent to the next level of memory at a time so that the lower level memory structure can begin processing them sooner. The question remains as to whether there exist multiple contending memory accesses during prefetching that would benefit from reduced contention. Figure 1 shows that with a next-n-line prefetcher. There is an increased number of references contending for access to main memory. Figure 1: Average number of outstanding memory requests Hypothesis A major component of communication delay time for prefetch units is bus contention. By adding the ability for memory to handle group requests from prefetch units, the average bus contention delay per memory access can be significantly reduced. In group prefetching, multiple prefetch requests would be calculated simultaneously and sent in a single packet to the bus instead of multiple bus access. Fortunately many existing prefetchers operate by generating a list of lines that the prefetch predicts will be needed in response to a single cache miss and thus provide a set of requests to group into a single bus access. The rest of the paper is organized as follows. Section 2 describes the architecture of our proposed group prefetching method. Section 3 gives the experimental methodology we used, including the additions to simplescalar and the benchmarks we ran. Section 4 presents the results of the simulations we ran, and Section 5 concludes the paper. 2 ARCHITECTURE Next-n-Line Prefetcher A basic next-n-line prefetcher [2] will be used to generate prefetch requests. In the event of a miss in the cache, the logic that is attached to the prefetcher checks to see if the next N lines after the miss are in the attached cache and then issues a request for those lines which are missing. Group Prefetch Communicator In a normal next-n-line prefetcher this process would be done by issuing individual sequential requests across the bus to the lower level memory structure to service each prefetch. For our proposed architecture, an additional mechanism will be added to the prefetch unit. This prefetch communicator will collect the results of the check to see which next N lines actually require prefetching and store the results as a bit vector. Instead of sending each prefetch 2request sequentially, the prefetch communicator will send the base address of the first necessary prefetch, a bit flag indicating that a group prefetch is being made, and the sparse vector that indicates by offset from the base address the location of the other prefetches that require servicing by the lower level memory structure. When the request reaches the lower levels of memory, the memory


View Full Document
Download Maximizing Resource Utilization
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Maximizing Resource Utilization and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Maximizing Resource Utilization 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?