Penn CIS 400 - An HDL implementation of an advanced cache coherence protocol

Unformatted text preview:

“Screwdriver:” An HDL implementation of an advanced cache coherence protocolPeter Hornyack ([email protected])Advisor: Dr. Milo M. K. MartinAbstractAs the performance limitations of single-core processors are reached, multiprocessing, orlinking together multiple processor cores to compute in parallel, is becoming the favored methodof increasing computer performance. One of the main difficulties with multiprocessing is cachecoherence, or making sure that different processors accessing the same memory do not interferewith each other. “Screwdriver” is a cache coherence protocol inspired by recently proposedprotocols that attempt to advance past the limitations of traditional shared-bus and directorybased coherence protocols. Screwdriver incorporates many of the features of these advancedprotocols to create a design that avoids race conditions and is easily scalable. Screwdriver wasimplemented using the Verilog hardware description language (HDL) to allow for realisticsimulation as well as physical instantiation on an FPGA. The use of HDL code also allowed theprotocol to be evaluated for multiprocessor systems of up to 16 processor cores. The Screwdriverprotocol implementation was found to be scalable to some degree, but was limited by thebandwidth of its single memory controller as the number of processors in the system increased.1. IntroductionAs the performance limitations of single-core processors (uniprocessors) are being reached,computer chip manufacturers are increasingly looking to multiprocessing for increasedcomputing performance. Multiprocessors can be constructed by linking together uniprocessorchips with high-speed connections, or by placing multiple processor cores on the same chip. Asingle core requires less than one million transistors, so mainstream chips containing over abillion transistors now frequently contain two or more processor cores. Chips containing morethan one processor core are known as chip-multiprocessors (CMPs).One difficulty with implementing multiprocessors is cache coherence. Maximumperformance is gained from multiprocessors when each processor core has its own local cachememory in addition to shared main memory. In such a configuration, multiple processors areoperating in parallel in the same address space, but when a processor executes a store, it writesdata only to its local cache. This may result in other processors having out-of-date data in theirlocal caches. To correct this situation, a cache coherence protocol is used to ensure that allprocessors are aware of changes to the data in their caches.A field-programmable gate array, or FGPA, is a chip containing user-programmable logic. Adesign can first be created and simulated in software, then downloaded onto the FPGA to createa physical working system. This process can be done multiple times, allowing extraordinarydesign flexibility. FPGAs are therefore commonly used for prototyping and in applicationsrequiring a low volume of task-specific processors.2. Related Work2.1 The cache coherence problemThe cache coherence problem and proposed solutions to it have existed as long asmultiprocessors themselves have. Researchers quickly realized that the full potential ofHornyack 2multiprocessors could only be realized if each processor core had its own local memory.Configurations in which multiple processors accessed one common memory or memoryhierarchy had limited scalability, with the shared memory bus between the processors and thememory being inundated with requests and becoming a system bottleneck [5]. While the use oflocal caches decreased the number of requests to shared memory, it introduced the problem ofcache coherence.The simplest kind of protocol for cache coherence is a broadcast-invalidate protocol, inwhich the address of a written cache block is broadcast to all processors. If a processor has thatblock in its cache, it invalidates the block so it cannot be used until it is updated with the current,correct data [5]. This kind of protocol is also known as a “snooping” protocol, because eachprocessor has to “listen” at all times for addresses that are being written to. The most basicbroadcast-invalidate protocol is the modified-invalid, or MI, protocol, in which a cache block canhave one of two states, either modified (up-to-date) or invalid (out-of-date). Only one processorcan have a copy of a block in the M state at any time, ensuring that no processor uses an out-of-date cache block. A more efficient protocol is the modified-shared-invalid, or MSI, protocol. TheMSI protocol’s modified and invalid states are the same as the MI protocol, but an additional“shared” state allows multiple caches to have read-only access to a cache block at the same time[4].2.2 Traditional cache coherence protocolsA “coherence message” is a processor’s request to either read or write a block that is notcurrently in its cache (or a block that is in its cache, but that it does not have the properpermissions for). In a multiprocessor system that uses cache coherence, other processors mustreceive these messages and invalidate or otherwise change the state of their own cache blocks asnecessary. Various forms of broadcast-invalidate protocols transmit their coherence messagesthrough the multiprocessor system in different ways. Perhaps the most obvious way is using afull interconnect network, with wires directly between each pair of processors (and mainmemory) for broadcasting invalidate addresses. The problem with a full interconnect network isthat it creates race conditions [3]. For example, two processors may broadcast coherencemessages one after the other, but due to interconnect delays and contention for resources (i.e.main memory), the later message could reach another processor before the earlier message andlead to incorrect data use.Two primary methods have been proposed to solve the problem of coherence message raceswith a broadcast-invalidate protocol. The first way is to broadcast coherence messages on ashared bus, rather than an interconnect network, with the shared bus providing “total ordering”for coherence messages; that is, the shared bus ensures that all processors and memory see thesame coherence messages in the same order. A


View Full Document

Penn CIS 400 - An HDL implementation of an advanced cache coherence protocol

Download An HDL implementation of an advanced cache coherence protocol
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view An HDL implementation of an advanced cache coherence protocol and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view An HDL implementation of an advanced cache coherence protocol 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?