Big-data ComputingReferenceBackgroundData Deluge: smallest to largestExamplesProblem SpaceSlide 7Processing GranularityTraditional Storage SolutionsSolution SpaceGoogle File SystemData CharacteristicsThe Context: Big-dataWhat is Hadoop?HadoopBasic Features: HDFSFault toleranceSlide 18Namenode and DatanodesHDFS ArchitectureHadoop Distributed File SystemArchitectureSlide 23Slide 24File system NamespaceData ReplicationReplica PlacementReplica SelectionSafemode StartupFilesystem MetadataNamenodeDatanodeProtocolThe Communication ProtocolRobustnessPossible FailuresDataNode failure and heartbeatRe-replicationCluster RebalancingData IntegrityMetadata Disk FailureData OrganizationData BlocksStagingStaging (contd.)Replication PipeliningAPI (Accessibility)Application Programming InterfaceFS Shell, Admin and Browser InterfaceSpace ReclamationMapReduce EngineWhat is MapReduce?Classes of problems “mapreducable”MapReduce Example in my Operating System ClassSlide 55Slide 56Job TrackerTaskTrackerMapReduce Example: MapperMapReduce Example: Combiner, ReducerSummaryB. RAMAMURTHYBig-data Computing01/13/19Bina Ramamurthy 20111ReferenceApache Hadoop: http://hadoop.apache.org/http://wiki.apache.org/hadoop/Hadoop: The Definitive Guide, by Tom White, 2nd edition, Oreilly’s , 2010Dean, J. and Ghemawat, S. 2008. MapReduce: simplified data processing on large clusters. Communication of ACM 51, 1 (Jan. 2008), 107-113.01/13/192Bina Ramamurthy 2011BackgroundProblem space is experiencing explosion of dataSolution space: emergence of multi-core, virtualization, cloud computingInability of traditional file system to handle data delugeThe Big-data Computing Model•MapReduce Programming Model (Algorithm)•Google File System; Hadoop Distributed File System (Data Structure)•Microsoft Dryad ( Large scale Data-base processing model)301/13/19Bina Ramamurthy 2011Data Deluge: smallest to largestBioinformatics data: from about 3.3 billion base pairs in a human genome to huge number of sequences of proteins and the analysis of their behaviorsThe internet: web logs, facebook, twitter, maps, blogs, etc.: Analyze …Financial applications: that analyses volumes of data for trends and other deeper knowledgeHealth Care: huge amount of patient data, drug and treatment dataThe universe: The Hubble ultra deep telescope shows 100s of galaxies each with billions of stars; 01/13/194Bina Ramamurthy 2011ExamplesComputational models that focus on data: large scale and/or complex dataExample1: web logfcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST-WebCrawler/2.1-pre2 ([email protected])"fcrawler.looksmart.com - - [26/Apr/2000:00:17:19 -0400] "GET /news/news.html HTTP/1.0" 200 16716 "-" "FAST-WebCrawler/2.1-pre2 ([email protected])"ppp931.on.bellglobal.com - - [26/Apr/2000:00:16:12 -0400] "GET /download/windows/asctab31.zip HTTP/1.0" 200 1540096 "http://www.htmlgoodies.com/downloads/freeware/webdevelopment/15.html" "Mozilla/4.7 [en]C-SYMPA (Win95; U)"H123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif HTTP/1.0" 200 6248 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"123.123.123.123 - - [26/Apr/2000:00:23:47 -0400] "GET /asctortf/ HTTP/1.0" 200 8130 "http://search.netscape.com/Computers/Data_Formats/Document/Text/RTF" "Mozilla/4.05 (Macintosh; I; PPC)"123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/5star2000.gif HTTP/1.0" 200 4005 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"123.123.123.123 - - [26/Apr/2000:00:23:50 -0400] "GET /pics/5star.gif HTTP/1.0" 200 1031 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"123.123.123.123 - - [26/Apr/2000:00:23:51 -0400] "GET /pics/a2hlogo.jpg HTTP/1.0" 200 4282 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"123.123.123.123 - - [26/Apr/2000:00:23:51 -0400] "GET /cgi-bin/newcount?jafsof3&width=4&font=digital&noshow HTTP/1.0" 200 36 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"Example 2: Climate/weather data modeling01/13/19 5Bina Ramamurthy 2011Problem Space6Data scaleCompute scalePayrollKilo Mega Giga TeraMFLOPSGFLOPSTFLOPSPFLOPSPetaDigitalSignal ProcessingWeblogMiningBusinessAnalyticsRealtimeSystemsMassively MultiplayerOnline game (MMOG)Other variables:CommunicationBandwidth, ?Exa01/13/19Bina Ramamurthy 2011Top Ten Largest Databases02/28/09 7Ref: http://www.businessintelligencelowdown.com/2007/02/top_10_largest_.html01/13/197Bina Ramamurthy 2011Processing GranularityPipelined Instruction levelConcurrent Thread levelService Object levelIndexed File levelMega Block levelVirtual System LevelData size: smallData size: large801/13/19Bina Ramamurthy 2011Traditional Storage Solutions901/13/19Bina Ramamurthy 2011Solution Space 1001/13/19Bina Ramamurthy 2011Google File System•Internet introduced a new challenge in the form web logs, web crawler’s data: large scale “peta scale”•But observe that this type of data has an uniquely different characteristic than your transactional or the “customer order” data : “write once read many (WORM)” ; •Privacy protected healthcare and patient information; •Historical financial data; •Other historical data •Google exploited this characteristics in its Google file system (GFS)1101/13/19Bina Ramamurthy 2011Data CharacteristicsStreaming data accessApplications need streaming access to dataBatch processing rather than interactive user access.Large data sets and files: gigabytes, terabytes, petabytes, exabytes sizeHigh aggregate data bandwidthScale to hundreds of nodes in a clusterTens of millions of files in a single instanceWrite-once-read-many: a file once created, written and closed need not be changed – this assumption simplifies coherency WORM inspired a new programming model called the MapReduce programming modelMultiple-readers can work on the read-only data concurrently01/13/1912Bina Ramamurthy 2011The Context: Big-dataData mining huge amounts of data collected in a wide range of domains from astronomy to healthcare has become essential for planning and performance.We are in a knowledge economy.Data is an important asset to any organizationDiscovery of knowledge; Enabling discovery; annotation of dataComplex computational modelsNo single environment is good enough: need elastic, on-demand capacitiesWe are looking at newer
View Full Document