UB CSE 421 - Big-data Computing - D412499

Home> Schools> University at Buffalo, The State University of New York> Computer Science & Engineering (CSE) > CSE 421> Big-data Computing

DOC PREVIEW

UB CSE 421 - Big-data Computing

School name University at Buffalo, The State University of New York

Course Cse 421- Introduction to Operating Systems

Pages 61

This preview shows page 1-2-3-4-28-29-30-31-58-59-60-61 out of 61 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 61 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 61 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 61 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 61 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 61 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 61 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 61 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 61 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 61 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 61 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 61 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 61 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 61 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Big-data ComputingReferenceBackgroundData Deluge: smallest to largestExamplesProblem SpaceSlide 7Processing GranularityTraditional Storage SolutionsSolution SpaceGoogle File SystemData CharacteristicsThe Context: Big-dataWhat is Hadoop?HadoopBasic Features: HDFSFault toleranceSlide 18Namenode and DatanodesHDFS ArchitectureHadoop Distributed File SystemArchitectureSlide 23Slide 24File system NamespaceData ReplicationReplica PlacementReplica SelectionSafemode StartupFilesystem MetadataNamenodeDatanodeProtocolThe Communication ProtocolRobustnessPossible FailuresDataNode failure and heartbeatRe-replicationCluster RebalancingData IntegrityMetadata Disk FailureData OrganizationData BlocksStagingStaging (contd.)Replication PipeliningAPI (Accessibility)Application Programming InterfaceFS Shell, Admin and Browser InterfaceSpace ReclamationMapReduce EngineWhat is MapReduce?Classes of problems “mapreducable”MapReduce Example in my Operating System ClassSlide 55Slide 56Job TrackerTaskTrackerMapReduce Example: MapperMapReduce Example: Combiner, ReducerSummaryB. RAMAMURTHYBig-data Computing01/13/19Bina Ramamurthy 20111ReferenceApache Hadoop: http://hadoop.apache.org/http://wiki.apache.org/hadoop/Hadoop: The Definitive Guide, by Tom White, 2nd edition, Oreilly’s , 2010Dean, J. and Ghemawat, S. 2008. MapReduce: simplified data processing on large clusters. Communication of ACM 51, 1 (Jan. 2008), 107-113.01/13/192Bina Ramamurthy 2011BackgroundProblem space is experiencing explosion of dataSolution space: emergence of multi-core, virtualization, cloud computingInability of traditional file system to handle data delugeThe Big-data Computing Model•MapReduce Programming Model (Algorithm)•Google File System; Hadoop Distributed File System (Data Structure)•Microsoft Dryad ( Large scale Data-base processing model)301/13/19Bina Ramamurthy 2011Data Deluge: smallest to largestBioinformatics data: from about 3.3 billion base pairs in a human genome to huge number of sequences of proteins and the analysis of their behaviorsThe internet: web logs, facebook, twitter, maps, blogs, etc.: Analyze …Financial applications: that analyses volumes of data for trends and other deeper knowledgeHealth Care: huge amount of patient data, drug and treatment dataThe universe: The Hubble ultra deep telescope shows 100s of galaxies each with billions of stars; 01/13/194Bina Ramamurthy 2011ExamplesComputational models that focus on data: large scale and/or complex dataExample1: web logfcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST-WebCrawler/2.1-pre2 ([email protected])"fcrawler.looksmart.com - - [26/Apr/2000:00:17:19 -0400] "GET /news/news.html HTTP/1.0" 200 16716 "-" "FAST-WebCrawler/2.1-pre2 ([email protected])"ppp931.on.bellglobal.com - - [26/Apr/2000:00:16:12 -0400] "GET /download/windows/asctab31.zip HTTP/1.0" 200 1540096 "http://www.htmlgoodies.com/downloads/freeware/webdevelopment/15.html" "Mozilla/4.7 [en]C-SYMPA (Win95; U)"H123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif HTTP/1.0" 200 6248 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"123.123.123.123 - - [26/Apr/2000:00:23:47 -0400] "GET /asctortf/ HTTP/1.0" 200 8130 "http://search.netscape.com/Computers/Data_Formats/Document/Text/RTF" "Mozilla/4.05 (Macintosh; I; PPC)"123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/5star2000.gif HTTP/1.0" 200 4005 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"123.123.123.123 - - [26/Apr/2000:00:23:50 -0400] "GET /pics/5star.gif HTTP/1.0" 200 1031 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"123.123.123.123 - - [26/Apr/2000:00:23:51 -0400] "GET /pics/a2hlogo.jpg HTTP/1.0" 200 4282 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"123.123.123.123 - - [26/Apr/2000:00:23:51 -0400] "GET /cgi-bin/newcount?jafsof3&width=4&font=digital&noshow HTTP/1.0" 200 36 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"Example 2: Climate/weather data modeling01/13/19 5Bina Ramamurthy 2011Problem Space6Data scaleCompute scalePayrollKilo Mega Giga TeraMFLOPSGFLOPSTFLOPSPFLOPSPetaDigitalSignal ProcessingWeblogMiningBusinessAnalyticsRealtimeSystemsMassively MultiplayerOnline game (MMOG)Other variables:CommunicationBandwidth, ?Exa01/13/19Bina Ramamurthy 2011Top Ten Largest Databases02/28/09 7Ref: http://www.businessintelligencelowdown.com/2007/02/top_10_largest_.html01/13/197Bina Ramamurthy 2011Processing GranularityPipelined Instruction levelConcurrent Thread levelService Object levelIndexed File levelMega Block levelVirtual System LevelData size: smallData size: large801/13/19Bina Ramamurthy 2011Traditional Storage Solutions901/13/19Bina Ramamurthy 2011Solution Space 1001/13/19Bina Ramamurthy 2011Google File System•Internet introduced a new challenge in the form web logs, web crawler’s data: large scale “peta scale”•But observe that this type of data has an uniquely different characteristic than your transactional or the “customer order” data : “write once read many (WORM)” ; •Privacy protected healthcare and patient information; •Historical financial data; •Other historical data •Google exploited this characteristics in its Google file system (GFS)1101/13/19Bina Ramamurthy 2011Data CharacteristicsStreaming data accessApplications need streaming access to dataBatch processing rather than interactive user access.Large data sets and files: gigabytes, terabytes, petabytes, exabytes sizeHigh aggregate data bandwidthScale to hundreds of nodes in a clusterTens of millions of files in a single instanceWrite-once-read-many: a file once created, written and closed need not be changed – this assumption simplifies coherency WORM inspired a new programming model called the MapReduce programming modelMultiple-readers can work on the read-only data concurrently01/13/1912Bina Ramamurthy 2011The Context: Big-dataData mining huge amounts of data collected in a wide range of domains from astronomy to healthcare has become essential for planning and performance.We are in a knowledge economy.Data is an important asset to any organizationDiscovery of knowledge; Enabling discovery; annotation of dataComplex computational modelsNo single environment is good enough: need elastic, on-demand capacitiesWe are looking at newer

View Full Document