Data Warehousing Why Data Warehousing Architectures Two tier vs Three tier What s a Data Mart Data Characteristics ETL steps and issues Star Schema Size does matter 1 Data Warehousing Information Gap Drowning in data but starving for information IS primarily supporting Operations There is a need to support strategic decision making 2 Data Warehousing So what s happening now Business needs an integrated company wide view of high quality information Usual operational data does not suffice Need informational type data that crosses organizational boundaries with time histories Extract Information from Data take action 3 Data Warehousing Information Knowledge Competitive Advantage Extractible knowledge Crypted knowledge Visible Information The full Data Warehouse Superficial Data 4 Data Warehousing Data Warehouse a subject oriented integrated time variant nonvolatile collection of data Organized around key subjects customers sales products of the enterprise Stored data has consistent naming conventions formats structures etc Stored data contains a time dimension history Stored data is continuously loaded refreshed but cannot be updated by users read only 5 Data Warehousing Operational Data from Various Sources Clean Combine Data Warehouse informational data Decision Makers 6 Data Warehousing Architectures Two level Small to medium sized firms Three level Large firms and or those with heterogeneous data Data warehouse large difficult to use directly 2 level Operationa l Data from Various Sources Select aggregate Clean Combine Data Warehouse informational data 3 level Data Mart 7 Data Warehousing To build a data warehouse must do data reconciliation commonly known in the field as ETL Extract Transform Load 1 Data are extracted from various source databases could be hundreds 2 Data is transformed cleaned and integrated 3 Then loaded into a data warehouse 8 Steps in ETL Static extract capturing a Static extract snapshot of the source data at a point in time Incremental extract Incremental extract capturing changes that have occurred since the last static extract 9 Steps in ETL Scrub cleanse uses pattern recognition and AI techniques to upgrade data quality Fixing errors misspellings Fixing errors erroneous dates incorrect field usage mismatched addresses missing data duplicate data inconsistencies Also Also decoding reformatting time stamping conversion key generation merging error detection logging locating missing data 10 Steps in ETL Transform convert data from format of operational system to format of data warehouse Record level Record level Selection data partitioning Joining data combining Aggregation data summarization Field level Field level single field from one field to one field multi field from many fields to one or one field to many 11 Steps in ETL Load Index place transformed data into the warehouse and create indexes Refresh mode bulk rewriting of Refresh mode target data at periodic intervals Update mode only changes in Update mode source data are written to data warehouse 12 ETL Issues Extract Finance Oracle DB Sales Mktg HP Compaq Data Warehouse Project Eng IBM Mainframe 13 ETL Issues Data Scrubbing hardest step Misspelled names addresses abbreviations Impossible or erroneous data negative weights Useless data 50 marked other Data in the wrong fields Mismatched data addresses ZIP codes Missing data Data errors in operational data common Data gatherers usually not data users Thus commitment to good process unlikely 14 Data Warehousing example USPS Air Irregularities Data Warehouse Air Irregularity Commercial air carriers carry US Mail They make mistakes Miss flights delayed flights wrong flights damage mail Data Warehouse seeking systemic errors Air carrier performance US Air in Boston USPS airport facility Chicago O Hare facility Data Warehouse Collected data on all scheduled flights Collected data on actual mail delivery performance On time performance actual mail volume vs planned Result Data Warehouse useless Too much missing incorrect data to make decisions 15 ETL Issues Transformation combining data Field transformation Converting fields to consistent format before combining Format integer vs floating algorithm F vs C Currency exchange rates Selection subsetting Extract what you need from raw operational data Joining Aggregation Combining data from various sources rolling up or summarizing detailed data 16 ETL Issues Load Index Think about the rate of arrival of data Every transaction from all cash registers at all Macy s stores Visa applications searches from all embassies offices for State Dept around the world 17 Data Marts Derived Data extracts for Data Marts Reduce population of data to improve response E g aggregation country region state city Customize for a particular target group sales not interested in finance info Aggregate by time only interested in last 2 years of data or only want quarterly summaries Move to a machine closer to end users to improve response time 18 Data Warehousing Structure Star Schema Many Dimension Tables contain key business information usually about one subject Feed one Fact Table joined Can be very large multiplicative Variable grain usually time daily weekly monthly store Fact Table product period vendors 19 Data Warehousing Table Size Fact Table Size Join dimension tables Example Stores 1000 100 bytes record Products 10 000 100 bytes record Periods 24 months 20 bytes record Vendors 4 vendors per product average 100 bytes record Fact Table 1 000 x 10 000 x 24 x 4 960 000 000 records Assume we use store id from Stores product id from Products vendor id from Products period id from Periods plus sales amt new Each field is 10 bytes 5 columns x 10 bytes per column 50 bytes row Total size rows x bytes row 50 x 960 000 000 48 000 000 000 48 gigabytes Size Does Matter 20 Star schema example 21
View Full Document