1Lecture 13: DRAM Innovations• Today: energy efficiency, row buffer management, scheduling2Latency and Power Wall• Power wall: 25-40% of datacenter power can beattributed to the DRAM system• Latency and power can be both improved by employingsmaller arrays; incurs a penalty in density and cost• Latency and power can be both improved by increasingthe row buffer hit rate; requires intelligent mapping ofdata to rows, clever scheduling of requests, etc.• Power can be reduced by minimizing overfetch – eitherread fewer chips or read parts of a row; incur penaltiesin area or bandwidth3Overfetch• Overfetch caused by multiple factors: Each array is large (fewer peripherals more density) Involving more chips per access more datatransfer pin bandwidth More overfetch more prefetch; helps apps withlocality Involving more chips per access less data losswhen a chip fails lower overhead for reliability4Re-Designing Arrays Udipi et al., ISCA’105Selective Bitline Activation• Two papers in 2010: Udipi et al., ISCA’10, Cooper-Balis and Jacob, IEEE Micro• Additional logic per array so that only relevant bitlinesare read out• Essentially results in finer-grain partitioning of the DRAMarrays6Rank Subsetting• Instead of using all chips in a rank to read out 64-bitwords every cycle, form smaller parallel ranks• Increases data transfer time; reduces the size of therow buffer• But, lower energy per row read and compatible withmodern DRAM chips• Increases the number of banks and hence promotesparallelism (reduces queuing delays)Mini-Rank, MICRO’08; MC-DIMM, SC’097Row Buffer Management• Open Page policy: maximizes row buffer hits, minimizesenergy• Close Page policy: helps performance when there islimited locality• Hybrid policies: can close a row buffer after it has servedits utility; lots of ways to predict utility: time, accesses,locality counters for a bank, etc.8Micro-Pages Sudan et al., ASPLOS’10• Organize data across banks to maximize locality in arow buffer• Key observation: most locality is restricted to a smallportion of an OS page• Such hot micro-pages are identified with hardwarecounters and co-located on the same row• Requires hardware indirection to a page’s new location• Works well only if most activity is confined to a fewmicro-pages9Scheduling Policies• The memory controller must manage several timingconstraints and issue a command when all resourcesare available• It must also maximize row buffer hit rates, fairness, andthroughput• Reads are typically given priority over writes; the writebuffer must be drained when it is close to full; changingthe direction of the bus requires 5-10 ns delay• Basic policies: FCFS, First-Ready-FCFS (prioritize rowbuffer hits)10STFM Mutlu and Moscibroda, MICRO’07• When multiple threads run together, threads with rowbuffer hits are prioritized by FR-FCFS• Each thread has a slowdown: S = Talone / Tshared, where T isthe number of cycles the ROB is stalled waiting for memory• Unfairness is estimated as Smax / Smin• If unfairness is higher than a threshold, thread prioritiesoverride other priorities (Stall Time Fair Memory scheduling)• Estimation of Talone requires some book-keeping: does anaccess delay critical requests from other threads?11PAR-BS Mutlu and Moscibroda, ISCA’08• A batch of requests (per bank) is formed: each thread canonly contribute R requests to this batch; batch requestshave priority over non-batch requests• Within a batch, priority is first given to row buffer hits, thento threads with a higher “rank”, then to older requests• Rank is computed based on the thread’s memory intensity;low-intensity threads are given higher priority; this policyimproves batch completion time• By using rank, requests from a thread are serviced inparallel; hence, parallelism-aware batch scheduling12TCM Kim et al., MICRO 2010• Organize threads into latency-sensitive ad bw-sensitiveclusters based on memory intensity; former gets higherpriority• Within bw-sensitive cluster, priority is based on rank• Rank is determined based on “niceness” of a thread andthe rank is periodically shuffled with insertion shuffling orrandom shuffling (the former is used if there is a big gap inniceness)• Threads with low row buffer hit rates and high bank levelparallelism are considered “nice” to others13Title•
View Full Document