Lineage Tracing for General Data Warehouse TransformationsOutlineData WarehousesLineage TracingAn ExampleSlide 6Lineage GranularityExisting WorkTracing Lineage - DefinitionsDetermining ContributionsSlide 11Transformation ClassesSchema MappingsProvided Inverses/Tracing ProceduresProperty HierarchyFinding LineageOptimizationsTransformation GraphsPerformanceQuestions?Lineage Tracing for General Data Warehouse TransformationsYingwei Cui and Jennifer WidomComputer Science Department, Stanford UniversityPresentation by Aaron St.ClairOutlineWhat is lineage tracing?Why is tracing lineage data important?How can we find lineage data?Performance resultsData WarehousesIntegrate data from multiple sourcesData undergoes series of transformationsTransformations vary in complexityData Source 1Data Source 2Data Source N…TransformationSummarized DataLineage TracingIdentifying the specific data items in the sources that derive a given data item in the warehouseAllowsIn-depth data analysisData miningAuthorization managementView updateEfficient warehouse recoveryAn ExampleSelects items whose last quarter sales are more than twice the average of the last three quarter’s salesAn ExampleLineage GranularityCoarse-GrainedSchema-level, attribute mappingFine-GrainedSet of source data itemsExisting WorkMostly coarse-grained lineageExisting methods for fine-grained lineageExtra annotationDeveloper-defined weak inversesStatistical estimationCan’t handle complex procedural transformationsTracing Lineage - DefinitionsData set – set of data items without duplicatesTransformation – any procedure that takes data sets as input and produces data sets as outputStable (no spurious output)Deterministic (under some conditions)Lineage of a data item – set of input data items that contribute to that itemDetermining Contributions•Need to find relevant data items–Easy for simple relational operators–Difficult for procedural transformations•Select positives vs. Aggregation and sumLineage Tracing•Use of hierarchical model–Transformation classes–Schema mappings–Defined inversesTransformation ClassesTransformation class defines procedure lineage determinationFor a dispatcher:Iteratively apply transformation to inputsIf T(I) is in output set add I to lineage of the output setSchema MappingsDefined schema for input and output of a transformation•Backward key-maps –Akey g(B)–T1Forward key-mapsf(A) Bkey T4Backward total-mapsA g(B)T5Provided Inverses/Tracing ProceduresBest case; someone has defined a function mapping output items to their deriving lineage itemsKnow nothing about efficiency of functionProperty HierarchyFinding Lineage•Recursively apply algorithms based on the transformation type until we reach top levelOptimizationsIndexing input data set improves performanceFunctional index using the schema optimizes queries of the form F(i) = vStore auxiliary or intermediate views in the warehouseReduce number by composing transformationsTransformation GraphsCreate a tracing sequence for each path from input to output in the graphCombine the results of each sequencePerformance•1GB warehouse•Schema mapping better than transformation class-specific algorithms•Indexing helps•Combining attributes reduces trace
View Full Document