Bigtable, Hive, and Pig

Home> Academic Documents> Bigtable, Hive, and Pig

DOC PREVIEW

This preview shows page 1-2-3-19-20-39-40-41 out of 41 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Slide 1Slide 2Today’s AgendaBigtableData ModelRows and ColumnsBigtable Building BlocksSSTableTabletTableArchitectureBigtable MasterBigtable Tablet ServersTablet LocationTablet AssignmentTablet ServingCompactionsBigtable ApplicationsLessons LearnedHBaseHive and PigNeed for High-Level LanguagesHive and PigHive: BackgroundHive ComponentsData ModelMetastorePhysical LayoutHive: ExampleHive: Behind the ScenesHive: Behind the ScenesHive DemoExample Data Analysis TaskConceptual DataflowSystem-Level DataflowMapReduce CodePig Latin ScriptJava vs. Pig LatinPig takes care of…Pig DemoQuestions?Bigtable, Hive, and PigData-Intensive Information Processing Applications ― Session #12Jimmy LinUniversity of MarylandTuesday, April 27, 2010This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for detailsSource: Wikipedia (Japanese rock garden)Today’s AgendaBigtableHivePigBigtableData ModelA table in Bigtable is a sparse, distributed, persistent multidimensional sorted mapMap indexed by a row key, column key, and a timestamp(row:string, column:string, time:int64)  uninterpreted byte arraySupports lookups, inserts, deletesSingle row transactions onlyImage Source: Chang et al., OSDI 2006Rows and ColumnsRows maintained in sorted lexicographic orderApplications can exploit this property for efficient row scansRow ranges dynamically partitioned into tabletsColumns grouped into column familiesColumn key = family:qualifierColumn families provide locality hintsUnbounded number of columnsBigtable Building BlocksGFSChubbySSTableSSTableBasic building block of BigtablePersistent, ordered immutable map from keys to valuesStored in GFSSequence of blocks on disk plus an index for block lookupCan be completely mapped into memorySupported operations:Look up value associated with keyIterate key/value pairs within a key rangeIndex64K block64K block64K blockSSTableSource: Graphic from slides by Erik PaulsonTabletDynamically partitioned range of rowsBuilt from multiple SSTablesIndex64K block64K block64K blockSSTableIndex64K block64K block64K blockSSTableTabletStart:aardvark End:appleSource: Graphic from slides by Erik PaulsonTableMultiple tablets make up the tableSSTables can be sharedSSTable SSTable SSTable SSTableTabletaardvarkappleTabletapple_two_EboatSource: Graphic from slides by Erik PaulsonArchitectureClient librarySingle master serverTablet serversBigtable MasterAssigns tablets to tablet serversDetects addition and expiration of tablet serversBalances tablet server loadHandles garbage collectionHandles schema changesBigtable Tablet ServersEach tablet server manages a set of tabletsTypically between ten to a thousand tabletsEach 100-200 MB by defaultHandles read and write requests to the tabletsSplits tablets that have grown too largeTablet LocationUpon discovery, clients cache tablet locationsImage Source: Chang et al., OSDI 2006Tablet AssignmentMaster keeps track of:Set of live tablet serversAssignment of tablets to tablet serversUnassigned tabletsEach tablet is assigned to one tablet server at a timeTablet server maintains an exclusive lock on a file in ChubbyMaster monitors tablet servers and handles assignmentChanges to tablet structureTable creation/deletion (master initiated)Tablet merging (master initiated)Tablet splitting (tablet server initiated)Tablet ServingImage Source: Chang et al., OSDI 2006“Log Structured Merge Trees”CompactionsMinor compactionConverts the memtable into an SSTableReduces memory usage and log traffic on restartMerging compactionReads the contents of a few SSTables and the memtable, and writes out a new SSTableReduces number of SSTablesMajor compactionMerging compaction that results in only one SSTableNo deletion records, only live dataBigtable ApplicationsData source and data sink for MapReduceGoogle’s web crawlGoogle EarthGoogle AnalyticsLessons LearnedFault tolerance is hardDon’t add functionality before understanding its useSingle-row transactions appear to be sufficientKeep it simple!HBaseOpen-source clone of BigtableImplementation hampered by lack of file append in HDFSImage Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.htmlHive and PigNeed for High-Level LanguagesHadoop is great for large-data processing!But writing Java programs for everything is verbose and slowNot everyone wants to (or can) write Java codeSolution: develop higher-level data processing languagesHive: HQL is like SQLPig: Pig Latin is a bit like PerlHive and PigHive: data warehousing application in HadoopQuery language is HQL, variant of SQLTables stored on HDFS as flat filesDeveloped by Facebook, now open sourcePig: large-scale data processing systemScripts are written in Pig Latin, a dataflow languageDeveloped by Yahoo!, now open sourceRoughly 1/3 of all Yahoo! internal jobsCommon idea:Provide higher-level language to facilitate large-data processingHigher-level language “compiles down” to Hadoop jobsHive: BackgroundStarted at FacebookData was collected by nightly cron jobs into Oracle DB“ETL” via hand-coded pythonGrew from 10s of GBs (2006) to 1 TB/day new data (2007), now 10x thatSource: cc-licensed slide by ClouderaHive ComponentsShell: allows interactive queriesDriver: session handles, fetch, executeCompiler: parse, plan, optimizeExecution engine: DAG of stages (MR, HDFS, metadata)Metastore: schema, location in HDFS, SerDeSource: cc-licensed slide by ClouderaData ModelTablesTyped columns (int, float, string, boolean)Also, list: map (for JSON-like data)PartitionsFor example, range-partition tables by dateBucketsHash partitions within ranges (useful for sampling, join optimization)Source: cc-licensed slide by ClouderaMetastoreDatabase: namespace containing a set of tablesHolds table definitions (column types, physical layout)Holds partitioning informationCan be stored in Derby, MySQL, and many other relational databasesSource: cc-licensed slide by ClouderaPhysical LayoutWarehouse directory in HDFSE.g., /user/hive/warehouseTables stored in subdirectories of warehousePartitions form subdirectories of tablesActual


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-19-20-39-40-41 out of 41 pages.

Please select your school