UB CSE 486 - Data Intensive Computing - D472092

Home> Schools> University at Buffalo, The State University of New York> Computer Science & Engineering (CSE) > CSE 486> Data Intensive Computing

UB CSE 486 - Data Intensive Computing

School name University at Buffalo, The State University of New York

Course Cse 486- Distributed Systems

Pages 80

Download Save

Unformatted text preview:

Data Intensive ComputingTopics for discussionData-Computation ContinuumMore dimensionsSlide Number 5Solution Processing GranularityTraditional Storage SolutionsDatabase and Database Management SystemDistributed file system(DFS)Emerging SystemsOn to Google FileData CharacteristicsThe Big-data Computing SystemThe Context: Big-dataGoals Of this DiscussionThe OutlineMapReduceWhat is MapReduce?From CS Foundations to MapReduceWord Counter and Result TableMultiple Instances of Word CounterImprove Word Counter for Performance Peta-scale DataAddressing the Scale IssuePeta-scale DataPeta Scale Data is Commonly Distributed Write Once Read Many (WORM) dataWORM Data is Amenable to ParallelismDivide and Conquer: Provision Computing at Data LocationMapper and ReducerMap OperationReduce OperationSlide Number 33MapReduce Example in my operating systems class MapReduce Programming Model MapReduce programming model MapReduce CharacteristicsClasses of problems “mapreducable”HadoopWhat is Hadoop?ReferenceBasic Features: HDFSFault toleranceData CharacteristicsArchitectureNamenode and DatanodesHDFS ArchitectureHadoop Distributed File SystemFile system NamespaceData ReplicationReplica PlacementReplica Selection Hadoop Distributed File SystemSafemode StartupFilesystem MetadataNamenode DatanodeProtocolThe Communication ProtocolRobustnessObjectivesDataNode failure and heartbeatRe-replicationCluster RebalancingData IntegrityMetadata Disk FailureData OrganizationData BlocksStagingStaging (contd.)Replication PipeliningAPI (Accessibility)Application Programming InterfaceFS Shell, Admin and Browser InterfaceSpace ReclamationCloud ComputingSummary (1)Summary (2)DemoReferencesData Intensive ComputingB. RamamurthyThis work is Partially Supported by NSF DUE Grant#: 0737243, 092033514/13/2010 Bina Ramamurthy 2010Topics for discussion• Problem Space: explosion of data• Solution space: emergence of multi‐core, virtualization, cloud computing• Inability of traditional file system to handle data deluge• Emerging systems: Google File System (GFS)• Salient features of GFS• The Big‐data Computing Model• MapReduce Programming Model (Algorithm)• Hadoop Distributed File System (Data Structure)• Cloud Computing and its Relevance to Big‐data and Data‐intensive computing (next class)24/13/2010 Bina Ramamurthy 2010Data‐Computation ContinuumCompute intensiveEx: computation of digits of PIhttp://bellard.org/pi/pi2700e9/pipcrecord.pdfData intensiveEx: analyzing web pages& index them3Machine learning; financial engineering; human genome project; Sloan Digital Sky survey; … 4/13/2010 Bina Ramamurthy 2010More dimensions4Data scaleCompute scalePayrollK M G TMFLOPSGFLOPSTFLOPSPFLOPSPDigitalSignal ProcessingWeblogMiningBusinessAnalyticsRealtimeSystemsMassively MultiplayerOnline game (MMOG)Other variables:CommunicationBandwidth, ?E4/13/2010 Bina Ramamurthy 2010Top Ten Largest Databases02/28/09 501000200030004000500060007000LOC CIA Amazon YOUTube ChoicePt Sprint Google AT&T NERSC ClimateTop ten largest databases (2007)TerabytesRef: http://www.businessintelligencelowdown.com/2007/02/top_10_largest_.html4/13/2010 5Bina Ramamurthy 2010Solution Processing GranularityPipelined Instruction levelConcurrent Thread levelService Object levelIndexed File levelMega Block levelVirtual System LevelData size: smallData size: large64/13/2010 Bina Ramamurthy 2010Traditional Storage SolutionsOff system/online stor age/ secondary memoryFile system abstractionOffline/ tertiary memoryRAID: Redundant Array of Inexpensive Disks NAS: Network Accessible StorageSAN: Storage area networks74/13/2010 Bina Ramamurthy 2010Database and Database Management System• Data source• Transactional • Data base server• Relational db or similar foundation• Tables, rows, result set, SQL• ODBC: open data base connectivity• Very successful business model: Oracle, DB2, MySQL, and others• Persistence models: EJB, DAO , ADO (look up the abbreviation in any of enterprise model documentation you are working with)84/13/2010 Bina Ramamurthy 2010Distributed file system(DFS)• A dedicated server manages the files for an compute environment• For example, nickelback,cse.buffalo.edu is your file server and that is why we did not want you to run your user applications on this machine.• DFS addresses various transparencies: location transparency, sharing , performance etc.• Single largest file is approximately few terabytes with typical page size of 4‐8K.• What is the page table size for the largest file? Ex: 16T/8K = ~8G94/13/2010 Bina Ramamurthy 2010Emerging Systems104/13/2010 Bina Ramamurthy 2010On to Google File• Internet introduced a new challenge in the form web logs, web crawler’s data: large scale “peta scale”• But observe that this type of data has an uniquely different characteristic than your transactional or the “order” data on amazon.com: “write once” ; • HIPPA protected healthcare and patient information; • Historical financial data; • Any historical data • Google exploited this characteristics in its Google file system: S. Ghemavat114/13/2010 Bina Ramamurthy 2010Data Characteristicsy Streaming data accessy Applications need streaming access to datay Batch processing rather than interactive user access.y Large data sets and files: gigabytes to ter abytes sizey High aggregate data bandwidthy Scale to hundreds of nodes in a clustery Tens of millions of files in a single instancey Write‐once‐read‐many: a file once created, written and closed need not be changed –this assumption simplifies coherency y WORM inspired a new progra mming model called the MapReduce progra mming model4/13/201012Bina Ramamurthy 2010The Big‐data Computing System134/13/2010 Bina Ramamurthy 2010The Context: Big‐data• Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009)• Google collects 270PB data in a month (2007), 20000PB a day (2008)• 2010 census data is expected to be a huge gold mine of information• Data mining huge amounts of data collected in a wide range of domains from astronomy to healthcare has become essential for planning and performance.• We are in a knowledge economy.– Data is an important asset to any organization– Discovery of knowledge; Enabling discovery; annotation of data– Complex computational models– No single environment is good enough: need elastic, on‐demand capacities• We are looking at newer – programming models, and– Supporting algorithms and

View Full Document


School:
Email:
New Password:
Confirm Password:

UB CSE 486 - Data Intensive Computing

Sign up for free to view:

Please select your school