Unformatted text preview:

ICS 278: Data Mining Lecture 2: Measurement and DataToday’s lectureMeasurementNominal variableMeasurements, cont.Slide 8Hierarchy of MeasurementsScalesSlide 12Mixed dataOther Kinds of MeasurementsDistance MeasuresVector data and distance matricesDistanceStandardizationWeighted Euclidean distanceOther Distance MetricsAdditive DistancesDependence among VariablesSample correlation coefficientSample Correlation MatrixMahalanobis distanceWhat about…Binary VectorsOther distance metricsTransforming DataData QualitySlide 31Next LectureData Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC IrvineICS 278: Data MiningLecture 2: Measurement and DataData Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC IrvineToday’s lecture•Feedback on quiz–Supplementary reading material: package being prepared•Update on projects•Office hours tomorrow: 9 to 10am•Outline of today’s lecture:–Finish material from Lecture 1–Chapter 2: Measurement and Data•Types of measurement•Distance measures•Data quality issuesData Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC IrvineMeasurementReal worldRelationship in dataDataRelationship in real worldMapping domain entities to symbolic representationsData Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC IrvineNominal variablehttp://trochim.human.cornell.edu/kb/measlevl.htmHere,numericalvaluesjust"name"theattributeuniquely.Noorderingimpliedi.e.jerseynumbersinbasketball;aplayerwithnumber30isnotmoreofanythingthanaplayerwithnumber15;certainlynottwicewhatevernumber15is.Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC IrvineMeasurements, cont.ordinal measurement-attributescanberank-ordered.Distancesbetweenattributesdonothaveanymeaning.i.e.,onasurveyyoumightcodeEducationalAttainmentas0=lessthanH.S.;1=someH.S.;2=H.S.degree;3=somecollege;4=collegedegree;5=postcollege.Inthismeasure,highernumbersmeanmore education.Butisdistancefrom0to1sameas3to4?No.Theintervalbetweenvaluesisnotinterpretableinanordinalmeasure.interval measurement-distancebetweenattributesdoeshavemeaning.i.e.,whenwemeasuretemperature(inFahrenheit),thedistancefrom30-40issameasdistancefrom70-80.Theintervalbetweenvaluesisinterpretable.averagemakessense,howeverratiosdon't-80degreesisnottwiceashotas40degreesData Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC IrvineMeasurements, cont.ratio measurement-anabsolutezerothatismeaningful.Thismeansthatyoucanconstructameaningfulfraction(orratio)witharatiovariable.Weightisaratiovariable.Inappliedsocialresearchmost"count"variablesareratio,forexample,thenumberofclientsinpastsixmonths.Why?Becauseyoucanhavezeroclientsandbecauseitismeaningfultosaythat"...wehadtwiceasmanyclientsinthepastsixmonthsaswedidintheprevioussixmonths."Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC IrvineHierarchyofMeasurementsData Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC IrvineScalesscale Legal transforms examplenominal Any one-one mapping Hair color, employmentordinal Any order preserving transform Severity, preferenceintervalMultiply by constant, add a constantTemperature, calendar timeratio Multiply by constant Weight, incomeData Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC IrvineWhy is this important?•As we will see….–Many models require data to be represented in a specific form–e.g., real-valued vectors•Linear regression, neural networks, support vector machines, etc•These models implicitly assume interval-scale data (at least)–What do we do with non-real valued inputs?•Nominal with M values: –Not appropriate to “map” to 1 to M (maps to an interval scale) –Why? w_1 x employment_type + w_2 x city_name–Could use M binary “indicator” variables»But what if M is very large? (e.g., cluster into groups of values)•Ordinal?Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC IrvineMixed data•Many real-world data sets have multiple types of variables, –e.g., demographic data sets for marketing–Nominal: employment type, ethnic group–Ordinal: education level–Interval: income, age•Unfortunately, many data analysis algorithms are suited to only one type of data (e.g., interval)•Exception: decision trees–Trees operate by subgrouping variable values at internal nodes–Can operate effectively on binary, nominal, ordinal, interval–We will see more details later…..Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC IrvineOther Kinds of Measurements•“Derived variables”–An operational or non-representational measurement: both defines the property and assigns a number to it.–Examples: quality of life in medicine, effort in software engineeringa = # of unique operators in programb = # of unique operandsn = total # of operator occurrencesm = total # of operand occurrencesProgramming effort: e = am(n+m)log(a+b)/2bData Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC


View Full Document

UCI ICS 278 - Measurement and Data

Download Measurement and Data
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Measurement and Data and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Measurement and Data 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?