Chico CSCI 693 - Optimizing ETL Processes in Data Warehouses - D2625281

Home> Schools> California State University, Chico> Computer Science (CSCI) > CSCI 693> Optimizing ETL Processes in Data Warehouses

Chico CSCI 693 - Optimizing ETL Processes in Data Warehouses

School name California State University, Chico

Course Csci 693- Research Methods in Computer Science

Pages 12

Download Save

Unformatted text preview:

Optimizing ETL Processes in Data Warehouses Alkis Simitsis Nat. Tech. Univ. of Athens [email protected] Panos Vassiliadis University of Ioannina [email protected] Timos Sellis Nat. Tech. Univ. of Athens [email protected] Abstract Extraction-Transformation-Loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. Usually, these processes must be completed in a certain time window; thus, it is necessary to optimize their execution time. In this paper, we delve into the logical optimization of ETL processes, modeling it as a state-space search problem. We consider each ETL workflow as a state and fabricate the state space through a set of correct state transitions. Moreover, we provide algorithms towards the minimization of the execution cost of an ETL workflow. 1. Introduction For quite a long time in the past, research has treated data warehouses as collections of materialized views. Although this abstraction is elegant and possibly sufficient for the purpose of examining alternative strategies for view maintenance, it is not enough with respect to mechanisms that are employed in real-world settings. Indeed, in real-world data warehouse environments, instead of automated mechanisms for the refreshment of materialized views, the execution of operational processes is employed in order to export data from operational data sources, transform them into the format of the target tables and finally, load them to the data warehouse. The category of tools that are responsible for this task is generally called Extraction-Transformation-Loading (ETL) tools. The functionality of these tools can be coarsely summarized in the following prominent tasks, which include: (a) the identification of relevant information at the source side; (b) the extraction of this information; (c) the customization and integration of the information coming from multiple sources into a common format; (d) the cleaning of the resulting data set, on the basis of database and business rules, and (e) the propagation of the data to the data warehouse and/or data marts. So far, research has only partially dealt with the problem of designing and managing ETL workflows. Typically, research approaches concern (a) the optimization of stand-alone problems (e.g., the problem of duplicate detection [16]) in an isolated setting and (b) problems mostly related to web data (e.g., [7]). Recently, research on data streams [1], [2] has brought up the possibility of giving an alternative look to the problem of ETL. Nevertheless, for the moment research in data streaming has focused on different topics, such as on-the-fly computation of queries [1], [2]. To our knowledge, there is no systematic treatment of the problem, as far as the problem of the design of an optimal ETL workflow is concerned. On the other hand, leading commercial tools [9], [10], [13], [14] allow the design of ETL workflows, but do not use any optimization technique. The designed workflows are propagated to the DBMS for execution; thus, the DBMS undertakes the task of optimization. Clearly, we can do better than this, because, an ETL process cannot be considered as a “big” query. Instead, it is more realistic to treat an ETL process as a complex transaction. In addition, in an ETL workflow, there are processes that run in separate environments, usually not simultaneously and under time constraints. One could argue that we can possibly express all ETL operations in terms of relational algebra and then optimize the resulting expression as usual. In this paper we demonstrate that the traditional logic-based algebraic query optimization can be blocked, basically due to the existence of data manipulation functions. Consider the example of Fig. 1 that describes the population of a table of a data warehouse DW from two source databases S1 and S2. In particular, it involves the propagation of data from the recordset PARTS1(PKEY,SOURCE,DATE,COST) of source S1that stores monthly information, as well as from the recordset PARTS2(PKEY,SOURCE,DATE,DEPT,COST) of source S2 that stores daily information. In the DW, PARTS(PKEY,SOURCE,DATE,COST) stores monthly information for the cost in Euros (COST) of parts (PKEY) per source (SOURCE). We assume that both the first supplier and the data warehouse are European and the second is American; thus, the data Proceedings of the 21st International Conference on Data Engineering (ICDE 2005) 1084-4627/05 $20.00 © 2005 IEEEcoming from the second source need to be converted to European values and formats. Figure 1. A simple ETL workflow In Fig. 1, activities are numbered with their execution priority and tagged with the description of their functionality. The flow for source S1 is: (3) a check for Not Null values is performed on attribute COST. The flow for source S2 is: (4) Dollar costs ($COST) are converted to Euros (€COST); (5) dates (DATE) are converted from American to European format; (6) an aggregation for monthly supplies is performed and the unnecessary attribute DEPT (for department) is discarded from the flow. The two flows are then unified (7) and before being loaded to the warehouse, a final check is performed on the €COSTattribute (8), ensuring that only values above a certain threshold are propagated to the warehouse. There are several interesting problems and optimization opportunities in the example of Fig. 1: - Traditional query optimization techniques should be directly applicable. For example, it is desirable to push selections all the way to the sources, in order to avoid processing unnecessary rows. - Is it possible to push the selection for values above a certain threshold early enough in the workflow? As far as the flow for source PARTS1 is concerned, this is straightforward (exactly as in the relational sense). On the other hand, as far as the second flow is concerned, the selection should be performed after the conversion of dollars to Euros. In other words, the activity performing the selection cannot be pushed before the activity applying the conversion function. - Is it possible to perform the aggregation, before the transformation of American values to Europeans? In principle, this should be allowed to happen, since the dates are kept in the resulting data and can be transformed later. In this case, the aggregation operations can be pushed, before the function (as opposed

View Full Document


School:
Email:
New Password:
Confirm Password:

Chico CSCI 693 - Optimizing ETL Processes in Data Warehouses

Sign up for free to view:

Please select your school