Unformatted text preview:

Scalable Hardware Memory Disambiguation for High ILP ProcessorsSimha Sethumadhavan Rajagopalan DesikanÝDoug BurgerCharles R. Moore Stephen W. KecklerComputer Architecture and Technology LaboratoryDepartment of Computer SciencesÝDepartment of Electrical and Computer EngineeringThe University of Texas at [email protected] - www.cs.utexas.edu/users/cartAbstractThis paper describes several methods for improving thescalability of memory disambiguation hardware for futurehigh ILP processors. As the number of in-flight instructionsgrows with issue width and pipeline depth, the load/storequeues (LSQ) threaten to become a bottleneck in both powerand latency. By employing lightweight approximate hash-ing in hardware with structures called Bloom filters manyimprovements to the LSQ are possible.We propose two types of filtering schemes using Bloomfilters: search filtering, which uses hashing to reduce boththe number of lookups to the LSQ and the number of en-tries that must be searched, and state filtering, in which thenumber of entries kept in the LSQs is reduced by couplingaddress predictors and Bloom filters, permitting smallerqueues. We evaluate these techniques for LSQs indexed byboth instruction age and the instruction’s effective address,and for both centralized and physically partitioned LSQs.We show that search filtering avoids up to 98% of the as-sociative LSQ searches, providing significant power sav-ings and keeping LSQ searches to under one high-frequencyclock cycle. We also show that with state filtering, the loadqueue can be eliminated altogether with only minor re-ductions in performance for small instruction window ma-chines.1. IntroductionComputer architects have been improving the per-formance of processors by implementing deeperpipelines [28], wider issue, and with larger out-of-order is-sue windows. These trends produce machines in whichmore than one hundred instructions may be in-flight [16].Recently, several researchers have proposed techniquesfor scaling issue windows to sizes of hundreds or thou-sands of instructions [8, 19, 22]. In all these actual andproposed machines, hardware must perform dynamic mem-ory disambiguation to guarantee that a memory orderingviolation does not occur.For any system capable of out-of-order memory issue,the memory ordering requirements are threefold. First, thehardware must check each issued load to determine if anearlier (program order) in-flight store was issued to thesame physical address, and if so, use the value produced bythe store. Second, each issued store must check to see if alater (program order) load to the same physical address waspreviously issued, and if so, take corrective action. Third,the hardware should ensure that loads and stores reach thememory system in the order specified by the memory con-sistency model. In many processors, the hardware that im-plements the above requirements is called the load/storequeue (LSQ).One disadvantage with current LSQ implementations isthat the detection of memory ordering violations requiresfrequent searches of considerable state. In a naive LSQ im-plementation, every in-flight memory instruction is storedin the LSQ. Thus, as the number of instructions in-flight in-creases, so does the number of entries that must be searchedin the LSQ to guarantee correct memory ordering. Both theaccess latency and the power requirements of LSQ searchesscale super-linearly with increases in the amount of stateas the LSQ is typically implemented using a CAM struc-ture [2]. As we show in the next section, simply reducing thesize of traditional LSQ designs for future machines causesan unacceptable drop in performance, whereas not doing soincurs unacceptable LSQ access latencies and power con-sumption. These traditional structures thus have the poten-tial to be a significant bottleneck for future systems.The technique evaluated in this paper to mitigate theseLSQ scalability limits is approximate hardware hashing.We implement low-overhead hash tables with Bloom fil-Proceedings of the 36th International Symposium on Microarchitecture (MICRO-36’03) 0-7695-2043-X/03 $ 17.00 © 2003 IEEEters [3], a structure in which a load or a store address ishashed to a single bit. If the bit is already set, there is alikely, but not a certain address match with another load orstore. If the bit is unset there cannot be an address matchwith another load or store. We use Bloom filters to evalu-ate the following LSQ improvements:¯Search filtering: Each load and store indexes into a lo-cation in the Bloom filter (BF) upon execution. If theindexed bit is set in the BF, a possible match has oc-curred, and the LSQ must be searched. If the indexedbit is clear, the bit is then set in the BF, but the LSQneed not be searched. However, all memory operationsmust still be allocated in the LSQ. This scheme reducesLSQ searches by 73-98% depending on the machineconfiguration.¯Partitioned search filtering: Multiple BFs each guarda different bank of a banked LSQ. When a load orstore is executed, all the BFs are indexed in paral-lel. LSQ searches occur only in the banks where theindexed bit in the BF is set. This policy enables abanked CAM structure which reduces both the numberof LSQ searches and the number of banks that must besearched. This scheme reduces the number of entriesthat must be searched by 86%.¯Load state filtering: A predictor examines each loadupon execution and predicts if a store to the same ad-dress is likely to be encountered during the lifetime ofthe load. If so, the load is stored in the memory or-dering queues. If the prediction is otherwise, the loadaddress is hashed in a load BF and is not kept in anymemory ordering queues. When stores execute, theycheck the load BF, and if a match occurs, a dependenceviolation may have occurred and the machine mustperform recovery. With this scheme, the load queuecan be completely eliminated, at a cost of 3% in per-formance for small instruction window machines. Forlarge window machines, however, our results show thatthe currently used BF’s hash functions cause too manyunnecessary flushes and are not an effective solution.With these schemes, we show that the area requiredfor LSQs can be reduced marginally and, more impor-tantly, that the power and latency for maintaining sequen-tial memory semantics can be significantly reduced. This, inturn alleviates a significant scalability bottleneck to higher-performance architectures requiring large LSQs.The


View Full Document
Download Scalable Hardware Memory
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Scalable Hardware Memory and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Scalable Hardware Memory 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?