Slide 1Slide 2Today’s AgendaBigtableData ModelRows and ColumnsBigtable Building BlocksSSTableTabletTableArchitectureBigtable MasterBigtable Tablet ServersTablet LocationTablet AssignmentTablet ServingCompactionsBigtable ApplicationsLessons LearnedHBaseHive and PigNeed for High-Level LanguagesHive and PigHive: BackgroundHive ComponentsData ModelMetastorePhysical LayoutHive: ExampleHive: Behind the ScenesHive: Behind the ScenesHive DemoExample Data Analysis TaskConceptual DataflowSystem-Level DataflowMapReduce CodePig Latin ScriptJava vs. Pig LatinPig takes care of…Pig DemoQuestions?Bigtable, Hive, and PigData-Intensive Information Processing Applications ― Session #12Jimmy LinUniversity of MarylandTuesday, April 27, 2010This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for detailsSource: Wikipedia (Japanese rock garden)Today’s AgendaBigtableHivePigBigtableData ModelA table in Bigtable is a sparse, distributed, persistent multidimensional sorted mapMap indexed by a row key, column key, and a timestamp(row:string, column:string, time:int64) uninterpreted byte arraySupports lookups, inserts, deletesSingle row transactions onlyImage Source: Chang et al., OSDI 2006Rows and ColumnsRows maintained in sorted lexicographic orderApplications can exploit this property for efficient row scansRow ranges dynamically partitioned into tabletsColumns grouped into column familiesColumn key = family:qualifierColumn families provide locality hintsUnbounded number of columnsBigtable Building BlocksGFSChubbySSTableSSTableBasic building block of BigtablePersistent, ordered immutable map from keys to valuesStored in GFSSequence of blocks on disk plus an index for block lookupCan be completely mapped into memorySupported operations:Look up value associated with keyIterate key/value pairs within a key rangeIndex64K block64K block64K blockSSTableSource: Graphic from slides by Erik PaulsonTabletDynamically partitioned range of rowsBuilt from multiple SSTablesIndex64K block64K block64K blockSSTableIndex64K block64K block64K blockSSTableTabletStart:aardvark End:appleSource: Graphic from slides by Erik PaulsonTableMultiple tablets make up the tableSSTables can be sharedSSTable SSTable SSTable SSTableTabletaardvarkappleTabletapple_two_EboatSource: Graphic from slides by Erik PaulsonArchitectureClient librarySingle master serverTablet serversBigtable MasterAssigns tablets to tablet serversDetects addition and expiration of tablet serversBalances tablet server loadHandles garbage collectionHandles schema changesBigtable Tablet ServersEach tablet server manages a set of tabletsTypically between ten to a thousand tabletsEach 100-200 MB by defaultHandles read and write requests to the tabletsSplits tablets that have grown too largeTablet LocationUpon discovery, clients cache tablet locationsImage Source: Chang et al., OSDI 2006Tablet AssignmentMaster keeps track of:Set of live tablet serversAssignment of tablets to tablet serversUnassigned tabletsEach tablet is assigned to one tablet server at a timeTablet server maintains an exclusive lock on a file in ChubbyMaster monitors tablet servers and handles assignmentChanges to tablet structureTable creation/deletion (master initiated)Tablet merging (master initiated)Tablet splitting (tablet server initiated)Tablet ServingImage Source: Chang et al., OSDI 2006“Log Structured Merge Trees”CompactionsMinor compactionConverts the memtable into an SSTableReduces memory usage and log traffic on restartMerging compactionReads the contents of a few SSTables and the memtable, and writes out a new SSTableReduces number of SSTablesMajor compactionMerging compaction that results in only one SSTableNo deletion records, only live dataBigtable ApplicationsData source and data sink for MapReduceGoogle’s web crawlGoogle EarthGoogle AnalyticsLessons LearnedFault tolerance is hardDon’t add functionality before understanding its useSingle-row transactions appear to be sufficientKeep it simple!HBaseOpen-source clone of BigtableImplementation hampered by lack of file append in HDFSImage Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.htmlHive and PigNeed for High-Level LanguagesHadoop is great for large-data processing!But writing Java programs for everything is verbose and slowNot everyone wants to (or can) write Java codeSolution: develop higher-level data processing languagesHive: HQL is like SQLPig: Pig Latin is a bit like PerlHive and PigHive: data warehousing application in HadoopQuery language is HQL, variant of SQLTables stored on HDFS as flat filesDeveloped by Facebook, now open sourcePig: large-scale data processing systemScripts are written in Pig Latin, a dataflow languageDeveloped by Yahoo!, now open sourceRoughly 1/3 of all Yahoo! internal jobsCommon idea:Provide higher-level language to facilitate large-data processingHigher-level language “compiles down” to Hadoop jobsHive: BackgroundStarted at FacebookData was collected by nightly cron jobs into Oracle DB“ETL” via hand-coded pythonGrew from 10s of GBs (2006) to 1 TB/day new data (2007), now 10x thatSource: cc-licensed slide by ClouderaHive ComponentsShell: allows interactive queriesDriver: session handles, fetch, executeCompiler: parse, plan, optimizeExecution engine: DAG of stages (MR, HDFS, metadata)Metastore: schema, location in HDFS, SerDeSource: cc-licensed slide by ClouderaData ModelTablesTyped columns (int, float, string, boolean)Also, list: map (for JSON-like data)PartitionsFor example, range-partition tables by dateBucketsHash partitions within ranges (useful for sampling, join optimization)Source: cc-licensed slide by ClouderaMetastoreDatabase: namespace containing a set of tablesHolds table definitions (column types, physical layout)Holds partitioning informationCan be stored in Derby, MySQL, and many other relational databasesSource: cc-licensed slide by ClouderaPhysical LayoutWarehouse directory in HDFSE.g., /user/hive/warehouseTables stored in subdirectories of warehousePartitions form subdirectories of tablesActual
or
We will never post anything without your permission.
Don't have an account? Sign up