Computer Architecture and Engineering CS152 Quiz 5 April 29 2010 Professor Krste Asanovic Name ANSWER KEY This is a closed book closed notes exam 80 Minutes 10 Pages Notes Not all questions are of equal difficulty so look over the entire exam and budget your time carefully Please carefully state any assumptions you make Please write your name on every page in the quiz You must not discuss a quiz s contents with students who have not yet taken the quiz If you have inadvertently been exposed to the quiz prior to taking it you must tell the instructor or TA You will receive no credit for selecting multiple choice answers without giving explanations if the instructions ask you to explain your choice Writing name on each sheet 1 Point Question 1 18 Points Question 2 15 Points Question 3 20 Points Question 4 26 Points TOTAL 80 Points NAME Problem Q5 1 Sequential Consistency 18 points In this problem we consider the implementation of sequential consistency SC in a cache coherent multiprocessor that uses a snoopy bus A processor can intervene in a bus transaction by asserting the retry signal causing the processor that initiated the request to attempt it again later The bus supports only one outstanding cache miss at a time for the entire system Each processor is equipped with a non blocking data cache that supports hit under miss while the cache is processing a miss accesses that hit in the cache can still proceed The cache does not support miss under miss so it will block in the event of a second miss A processor sends loads and stores to its cache in program order Problem Q5 1 A 10 points Without further modifications the hit under miss scheme can lead to violations of sequential consistency Carefully explain why Additionally provide a short pseudocode sequence for two processors that could result in a sequentially inconsistent execution Indicate whether each data memory access in your example is a cache hit or miss Hit under miss enables younger memory accesses that hit in the cache to be made visible before an older memory access that misses in the cache Another processor could see these accesses out of program order which violates SC Assume that variables X and Y start out as 0 Because the write to Y may proceed before the write to X the following sequence of accesses could result in r1 1 r2 0 which is clearly a violation of SC Processor 0 Hit Miss Processor 1 Hit Miss X 1 M r1 Y M Y 1 H r2 X H NAME Problem Q5 1 B 8 points Fortunately it is possible to make the hit under miss scheme appear to be sequentially consistent by leveraging the snoopy bus Describe how to implement SC while maintaining hit under miss Reordering only violates SC if other processors can observe it The only way for e g P1 to observe P0 s accesses is via a bus transaction So if P1 initiates a bus transaction while P0 is servicing a miss that has been hit under P0 can tell P1 to retry until the miss has been resolved This way the miss and all subsequent hits are made visible at the same time NAME Problem Q5 2 Relaxed Memory Models 15 points The following code implements a seqlock which is a reader writer lock that supports a single writer and multiple readers The writer never has to wait to update the data protected by the lock but readers may have to wait if the writer is busy We use a seqlock to protect a variable that holds the current time The lock is necessary because the variable is 64 bits and thus cannot be read or written atomically on a 32 bit system The seqlock is implemented using a sequence number seqno which is initially zero The writer begins by incrementing seqno It then writes the new time value which is split into the 32 bit values time lo and time hi Finally it increments seqno again Thus if and only if seqno is odd the writer is currently updating the counter The reader begins by waiting until seqno is even It then reads time lo and time hi Finally it reads seqno again If seqno didn t change from the first read then the read was successful otherwise the read is retried This code is correct on a sequentially consistent system but on a system with a fully relaxed memory model it may not be Insert the minimum number of memory fences to make the code correct on a system with a relaxed memory model To insert a fence write the needed fence MembarLL MembarLS MembarSL MembarSS in between the lines of code below Writer Reader LOAD Rseqno seqno ADD Rseqno Rseqno 1 STORE seqno Rseqno MembarSS STORE time lo Rtime lo Loop LOAD Rseqno before seqno IF Rseqno before 1 goto Loop MembarLL LOAD Rtime lo time lo LOAD Rtime hi time hi MembarLL STORE time hi Rtime hi MembarSS LOAD Rseqno after seqno ADD IF Rseqno before Rseqno after goto Loop Rseqno Rseqno 1 STORE seqno Rseqno NAME Problem Q5 3 Directory Protocols 20 points Alice P Hacker is designing a large scale cache coherent multiprocessor Recalling that directories are more scalable than snoopy buses Alice decides to implement a full map directory to maintain coherence Problem Q5 3 A 6 points Alice s first directory design maintains a bit vector for each cache line sized block of main memory For a given memory block each bit in the bit vector represents whether or not one processor in the machine has that line cached For a 256 processor system with 64B cache lines and 8GB of main memory how large would Alice s directory be Directory size of cache lines of processors bits 8GB 64B 256 bits 32 Gbits 4GB 4GB Problem Q5 3 B 6 points Alice realizes that her initial design might be impractical to implement She instead considers a directory organization that supports up to four sharers of a line at a time If a fifth processor requests a copy of the line one of the original four sharers is invalidated For each memory line this approach requires four processor IDs and four valid bits Now how large is Alice s directory Directory size of cache lines 4 1 log2 of processors bits 8GB 64B 36 bits 9 0 5 Gbits 576MB NAME 576MB Problem Q5 3 C 8 points Alice is concerned about the performance of the directory that supports only four sharers at a time As a final design alternative she considers also adding a globally shared bit to each line Now more than four processors can share the line but if a fifth processor requests a copy of a cache line the globally shared bit is set If some processor attempts to write a globally shared line the directory no longer knows precisely which processors have a copy so it sends invalidations to all processors in the system Consider a program
View Full Document
Unlocking...