1Active Disks:Programming Model, Algorithms and EvaluationBy Anurag Acharya, Mustafa Uysal, Joel SaltzPublished in ASPLOS VIII (1998)Presented by Ryan FarrellOverview• Motivation– Large Data sets are growing more quickly that processor speeds, but Customers want to be able to process their data sets in the same amount of time• Customers doubling storage every 5 months• CPU speeds double every 18 months [Moore’s law] – While CPUs are “cheap” compared to storage ($100 vs. $1100-1750), CPU can only keep a small number of high-performance drives busy.• Proposed Solution– Offload processing onto disk drives using drives’ embedded processor.Active Disks Architecture• Host-resident Component– Primarily for coordination, scheduling and combining results• Disk-resident Component (“Disklets”)– Downloaded to from host, does lion’s share of processing– Stream processor:• Reads input streams (files) from disk• Writes output streams (files) to disk• Optional long-term scratch spaceDisk/Host-level OS• DiskOS– Memory Management – allocation and management of buffers– Stream Communication – possible to overlap data movement and communication– Disklet Scheduling – disklets ready to run whenever there is data available• Host-level OS Support– Installation of disklets– Management of host-resident buffers (streams)Disklet/Host Code Algorithms (1/6)• SQL SELECT– Filter tuples from a relation based on a user-specified predicate– Conventional Disk Algorithm•Read all tuples from disk, keep those that match filter– Active Disk Algorithm• On disk, read tuples from input stream (file), write matching tuples to output file, send only matching results when finished (or partial results when output file fills up)2Algorithms (2/6)• SQL GROUP-BY– Compute a vector of aggregates indexed by a list of attributes– Conventional Disk Algorithm•Read all tuples from disk, accumulating group-by results– Active Disk Algorithm•Perform local group-bys• When out of space, ship partial result to host.Algorithms (3/6)• External Sort– Sort database tuples (details?)– Conventional Disk Algorithm (NOWsort)• Reader-Thread reads data from disk and moves tuple pointers to buckets• Writer-Thread sorts each bucket with partial-radix sort– Active Disk Algorithm (NOWsort)• Partitioner divides records into buckets and when bucket fills, sends to host• When host buffer fills, host sends to Sorter which sorts and writes to output stream• Merging done locally on disk, final results sent to hostAlgorithms (4/6)• Datacube– In effect compute group-bys for all possible combinations of a list of attributes– Conventional Disk Algorithm (PipeHash)• Schedule group-bys as sequence of pipelines, each computed from disk and written back to disk for next pipeline– Active Disk Algorithm• Separate disklet for each pipeline•Perform local group-bys, host accumulates partial results and stores for use by later pipelinesAlgorithms (5/6)• Image Convolution (Batch Processing)– Image processing operations such as Smoothing, Sharpening Edge Detection, – Conventional Disk Algorithm•Read all images, concatenate into single file and stripe across disks (then process?)– Active Disk Algorithm• Does convolution of individual images locally•Sends processed image to hostAlgorithms (6/6)• Compositing Satellite Images– Project individual satellite images onto much larger composite image (image registration)– Conventional Disk Algorithm (from NASA)•Read large chunks of individual satellite images,• Composites with accumulator for each output pixel.– Active Disk Algorithm• Perform pre-processing and mapping at disk• Perform most of composition locallyPerformance Comparison (Simulated)3Performance Comparison (Simulated) Performance Comparison (Simulated)Performance Comparison (Simulated)Results/Conclusion• Theoretical Performance– Active Disk architecture outperforms Conventional disks 1-3x for 4-disk configuration and 3-30x for 32-disk configuration• Today (2005)– Today, disks are relatively cheap, $500/TB $0.15/GB on Black Friday) and have larger caches and more powerful controllers.– Though it seems like a great idea, I couldn’t find evidence of mainstream usage.What are the Authors doing now?•Who (At time of publication) - Currently• Anurag Acharya (PostDoc under Saltz) – Professor at UC Santa Barbara.– Current Research: “Rapidly Evolvable Network Services”• Mustafa Uysal (PhD under Saltz) – Now At HP Labs– Current Research: “Design, implementation and analysis of large-scale storage systems and distributed, data-intensive computing”. • Joel Saltz (Professor) – Chair of Ohio State Biomedical Informatics Department (joint appointments at JHU, UMIACS)– Current Research: “Data Intensive and Grid Computing, Distributed and Parallel Systems, High End Medical Applications.”Additional Work on Active Disks• CMU (dates back to ’97)– http://www.pdl.cmu.edu/Active/• HPCA 2000 (Best Student Paper)– http://www.hpl.hp.com/personal/Mustafa_Uysal/papers/2000-hpca/hpca.pdf• “NAS (Network Attached Storage)”–
View Full Document