U of U CS 7810 - Lecture 13 - DRAM Innovations - D2067842

Home> Schools> University of Utah> Computer Science (CS) > CS 7810> Lecture 13 - DRAM Innovations

DOC PREVIEW

U of U CS 7810 - Lecture 13 - DRAM Innovations

School name University of Utah

Course Cs 7810- Advanced Computer Architecture

Pages 13

This preview shows page 1-2-3-4 out of 13 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1Lecture 13: DRAM Innovations• Today: energy efficiency, row buffer management, scheduling2Latency and Power Wall• Power wall: 25-40% of datacenter power can beattributed to the DRAM system• Latency and power can be both improved by employingsmaller arrays; incurs a penalty in density and cost• Latency and power can be both improved by increasingthe row buffer hit rate; requires intelligent mapping ofdata to rows, clever scheduling of requests, etc.• Power can be reduced by minimizing overfetch – eitherread fewer chips or read parts of a row; incur penaltiesin area or bandwidth3Overfetch• Overfetch caused by multiple factors: Each array is large (fewer peripherals  more density) Involving more chips per access  more datatransfer pin bandwidth More overfetch  more prefetch; helps apps withlocality Involving more chips per access  less data losswhen a chip fails  lower overhead for reliability4Re-Designing Arrays Udipi et al., ISCA’105Selective Bitline Activation• Two papers in 2010: Udipi et al., ISCA’10, Cooper-Balis and Jacob, IEEE Micro• Additional logic per array so that only relevant bitlinesare read out• Essentially results in finer-grain partitioning of the DRAMarrays6Rank Subsetting• Instead of using all chips in a rank to read out 64-bitwords every cycle, form smaller parallel ranks• Increases data transfer time; reduces the size of therow buffer• But, lower energy per row read and compatible withmodern DRAM chips• Increases the number of banks and hence promotesparallelism (reduces queuing delays)Mini-Rank, MICRO’08; MC-DIMM, SC’097Row Buffer Management• Open Page policy: maximizes row buffer hits, minimizesenergy• Close Page policy: helps performance when there islimited locality• Hybrid policies: can close a row buffer after it has servedits utility; lots of ways to predict utility: time, accesses,locality counters for a bank, etc.8Micro-Pages Sudan et al., ASPLOS’10• Organize data across banks to maximize locality in arow buffer• Key observation: most locality is restricted to a smallportion of an OS page• Such hot micro-pages are identified with hardwarecounters and co-located on the same row• Requires hardware indirection to a page’s new location• Works well only if most activity is confined to a fewmicro-pages9Scheduling Policies• The memory controller must manage several timingconstraints and issue a command when all resourcesare available• It must also maximize row buffer hit rates, fairness, andthroughput• Reads are typically given priority over writes; the writebuffer must be drained when it is close to full; changingthe direction of the bus requires 5-10 ns delay• Basic policies: FCFS, First-Ready-FCFS (prioritize rowbuffer hits)10STFM Mutlu and Moscibroda, MICRO’07• When multiple threads run together, threads with rowbuffer hits are prioritized by FR-FCFS• Each thread has a slowdown: S = Talone / Tshared, where T isthe number of cycles the ROB is stalled waiting for memory• Unfairness is estimated as Smax / Smin• If unfairness is higher than a threshold, thread prioritiesoverride other priorities (Stall Time Fair Memory scheduling)• Estimation of Talone requires some book-keeping: does anaccess delay critical requests from other threads?11PAR-BS Mutlu and Moscibroda, ISCA’08• A batch of requests (per bank) is formed: each thread canonly contribute R requests to this batch; batch requestshave priority over non-batch requests• Within a batch, priority is first given to row buffer hits, thento threads with a higher “rank”, then to older requests• Rank is computed based on the thread’s memory intensity;low-intensity threads are given higher priority; this policyimproves batch completion time• By using rank, requests from a thread are serviced inparallel; hence, parallelism-aware batch scheduling12TCM Kim et al., MICRO 2010• Organize threads into latency-sensitive ad bw-sensitiveclusters based on memory intensity; former gets higherpriority• Within bw-sensitive cluster, priority is based on rank• Rank is determined based on “niceness” of a thread andthe rank is periodically shuffled with insertion shuffling orrandom shuffling (the former is used if there is a big gap inniceness)• Threads with low row buffer hit rates and high bank levelparallelism are considered “nice” to others13Title•

View Full Document