DOC PREVIEW
NYU CSCI-GA 3033 - Data Quality

This preview shows page 1-2-14-15-29-30 out of 30 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Data QualityGoalsGoals 2Cost of Low Data Quality5 StepsEvidence of Economic ImpactThe Information ChainThe Information Chain 2Information Chain 3Impacts of Low Data QualityHard ImpactsSoft ImpactsEconomic MeasuresImpact DomainsOperational ImpactsTactical/Strategic ImpactsPutting it TogetherROI ModelData Cleansing ProjectRecord ParsingData DomainsData Domains 2Data Domains 3Slide 24TokenizingDomain MembershipDomain Membership 2PatternsContextNext WeekData QualityClass 2David LoshinGoals•Cost of low data quality•Mapping the information chain•Data Quality impacts•Economic measures•Impact domains•Building the Data Quality ROI ModelGoals 2•Data Cleansing Project–Goal of the application–Components of the applicationCost of Low Data Quality•Data quality is measured using anecdotes•“Hazy” feeling of wrongness•Desire to gauge the true cost of poor data quality5 Steps•Map the Information Chain•Categorize costs associated with low data quality•Identify and estimate actual effect•Determine cost of fixing problem•Calculate Return on Investment (ROI)Evidence of Economic Impact•Frequent service interruptions and system failures•Drop in productivity vs. volume•High employee turnover•High new business/continued business ratio•Increased customer service requirements•Customer AttritionThe Information Chain•Data flow model•Processing stages•Communication/data transferThe Information Chain 2•Data Supply•Data Acquisition•Data Creation•Data Processing•Data Packaging•Decision Making•Decision Implementation•Data Delivery•Data ConsumptionInformation Chain 3•Information chain = data flow graph•Processing stages are vertices in graph •Directed message-passing channels = directed edges•ExamplesImpacts of Low Data Quality•Hard impacts: can be estimated and/or measured•Soft impacts: hard to measure, but definitely are evidentHard Impacts•Customer attrition•Costs attributed to error detection•Costs attributed to error rework•Costs attributed to prevention of errors•Costs associated with customer service•Costs associated with fixing customer problems•Costs associated with enterprisewide data inconsistency•Costs attributable to delays in processingSoft Impacts•Difficulty in decision making•Time delays in operation•Organizational mistrust•Lowered ability to effectively compete•Data ownership conflicts•Lowered employee satisfactionEconomic Measures•Cost Increase •Revenue Decrease•Cost Decrease •Revenue Increase•Delay•Speedup•Increase Satisfaction•Decrease SatisfactionImpact Domains•Operational•Tactical/StrategicOperational Impacts•Detection•Correction•Rollback•Rework•Prevention•Warranty•Reduction•Attrition•Blockading.Tactical/Strategic Impacts•Delays•Preemption•Idling•Increased Difficulty•Lost opportunities•Organizational mistrust•Alignment•Acquisition overhead•Decay•InfrastructurePutting it Together•Map the information chain•Conduct interviews to locate data quality problems•Annotate information chain with location of data qualty problems•Identify impact domains for each problem•Characterize economic impact (=cost!)•Aggregate totalsROI Model•Create a spreadsheet with assigned costs•Add in costs of improvements•Determine best return on investmentData Cleansing Project•Write an application to cleanse data–Record Parsing–Metadata cleansing–Data standardization–Data correction–Data enhancementRecord Parsing•Data element types–first names–last names–honorifics–titles–street names–directions–business words–etc.Data Domains•Data types•Subclassed data types = domains•Mappings between domainsData Domains 2•Data type = char(2)–676 possible non-punctuation members•Data Domain: US State abbreviations–62 possible members•Subclassed data domain: “New England”–{“ME”, “NH”, “VT”, “MA”, “CT”, “RI”}Data Domains 3•Enumerated domains–All values are explicit•Rule-based domains–Domain definition is generativeRecord Parsing•Tokenizing data elements within an attribute•Assign meaning to tokens–Domain membership–Patterns–ContextTokenizing•Straightforward–white-space separated–punctuation – important or not?–Result: stream of tokensDomain Membership•Can each token be assigned to a domain?–Based strictly on token value–Based on patterns–Based on contextDomain Membership 2•Domains can be maintained in memory using hash tables•Search for domain membership is the same as hash table lookups•What if a token belongs to more than one domain?Patterns•Certain kinds of data attributes are organized around token patterns•Example: names can appear using these kinds of patterns:•(title) (first) (middle) (last)•(title) (first) (initial) (last)•(first) (middle) (last)•(last) (comma) {first) (middle)•etc.Context•What happens when a token belongs to more than one domain?•We can use context to infer decision•Build weights based on frequency = trainingNext Week•Dimensions of Data Quality•Project


View Full Document

NYU CSCI-GA 3033 - Data Quality

Documents in this Course
Design

Design

2 pages

Real Time

Real Time

17 pages

Load more
Download Data Quality
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Data Quality and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Data Quality 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?