5/12/10&1&Data&Mining&(Knowledge&Discovery)&CISC437/637,&Lecture&Ben&CartereGe&1&Copyright&©&Ben&CartereGe&IntroducKon&to&Data&Minin g&• Data$mining&is&the&automa&c&exploraKon&and&analysis&of&large&quanKKes&of&data&in&order&to&discover&valid,&novel,&potenKally&useful,&and&understandable&paGerns&in&data&– Valid:&&the&paGerns&hold&in&general&– Novel:&&something&we&didn’t&know&before&– Useful:&&we&can&use&the&paGerns&to&accomplish&something&– Understandable:&&we&can&comprehend&why&the&paGerns&hold&• Example:&&if°ree=MS&and&income>75000,&then&credit=Excellent&(with&82%&probability)&2&Copyright&©&Ben&CartereGe&5/12/10&2&Two&Classes&of&PaGerns&• Predic8ons&about&new&data&items&based&on&knowledge&about&exisKng&data&items&– Classifica8on$algorithms&predict&categorical&variables&– Regression&models&predict&numeric&variables$• Associa8ons$between&two&or&more&field&values&across&records&– Clustering&discovers&groupings&of&records&according&to&field&values$Copyright&©&Ben&CartereGe& 3&Variable&Types&• Variable&type&depends&on&domain:&– Numerical:&&domain&is&numeric&(integer,&real,&etc)&– Nominal&or&categorical:&&domain&i s&a&fin ite&set&with&no&natural&ordering&(e.g.&occupaKon,&gender,&etc)&– Ordinal:&&domain&is&ordered&(e.g.&preference&scales,&severity&of&injury,&etc)&• The&types&of&variables&determine&what&kinds&of&data&mining&techniques&might&be&used&Copyright&©&Ben&CartereGe& 4&5/12/10&3&ClassificaKon&• The&goal&of&classificaKon&is&to&predict&the&value&of&a&categorical-variable&given&one&or&more&numeric,&ordinal,&or&categorical&variables&• DefiniKons:&– C&=&categorical&variable/class&labels&– X1,&…,&Xk&=&variables/features/aGributes&– F:&&X1&×&X2&×&…&×&Xk&&C&is&the&classificaKon&funcKon&• Given&training$data&with&features&X1…Xk&and&known&class&C,&find&the&best&F&Copyright&©&Ben&CartereGe& 5&ClassificaKon&• We&do¬&expect&that&the&classificaKon&will&be&exact&– Errors&will&occur;&our&goal&is&to&minimize&them&• The&funcKon&F&should:&– Be&able&to&classify&items&with&high&accuracy&– Produce&informaKon&(the&class)&that&we&don’t&have&and&that&is&useful&for&something&– Be&interpretable&by&those&using&it&(i.e.&it&should&make&sense)&Copyright&©&Ben&CartereGe& 6&5/12/10&4&Decision&Trees&• Decision$trees$are&a&popular&algorithm&for&classificaKon&• A&decision&tree&is&basically&a&hierarchy&of&if/then/else&(or&switch&case)&statements&– Each&node&corresponds&to&a&switch&case&on&a&variable&– The&case&determines&which&node&to&go&to&next&– The&leaves&of&the&tree&correspond&to&a&predicted&class&label&for&the&record&• But&instead&of&being&handhcoded,&the&cases&are&automaKcally&learned&from&data&Copyright&©&Ben&CartereGe& 7&Learning&Decision&Trees&• Start&with&a&training&set:&&data&records&from&which&we&can&compute&aGributes&X1,…,Xk&and&class&label&C&• Recursive&algorithm&outline:&LearnDecisionTree(R)&– Choose&a&par88oning$a?ribute&Xi&• If&no&parKKoning&is&possible,&return&– Split&records&in&R&into&groups&R1,&R2,&…,&Rj&by&values&of&Xi&– For&each&group&Ri,&• LearnDecisionTree(Ri)&Copyright&©&Ben&CartereGe& 8&5/12/10&5&Choosing&AGributes&• The&best&aGribute&is&the&one&that&can&most&cleanly&separates&records&according&to&their&class&• Calculate&proporKons&of&records&remaining&in&each&class&– pi&=&(#&records&in&class&i)/(total&#&of&records)&• Use&the&idea&of&entropy&from&informaKon&theory&– Entropy&=&H(Y)&=&number&of&bits&needed&to&encode&Y&• Entropy&of&class&=&&Copyright&©&Ben&CartereGe& 9&H(C)=−�ipilog2piChoosing&AGributes&• Calculate&condi8onal$entropy&of&C&given&Xi&(denoted&H(C|Xi))&as&follows:&– For&each&aGribute&Xi,&• Let&H(C|Xi)&=&0&• For&each&unique&value&xj&of&Xi,&– Calculate&H(C)&for&records&with&Xi=xj&– Calculate&pij&=&proporKon&of&records&with&Xi=xj&– H(C|Xi)&+=&pij*H(C)&• Finally,&calculate&informa8on$gain$for&each&aGr:&– IG(C;&Xi)&=&H(C)&–&H(C|Xi)&• Split&on&the&aGribute&with&greatest&IG&Copyright&©&Ben&CartereGe& 10&5/12/10&6&EvaluaKng&Performance&• We&use&the&trained&classifier&to&predict&the&class&of&new&data&records&• When&we&can&obtain&the&actual&class,&we&can&compare&it&to&our&predicKon&to&see&how&we&did&• This&is&evalua8on&of&the&accuracy&of&the&classifier&• Based&on&a&con8ngency$table:&• Accuracy&=&(A+D)/(A+B+C+D)&Copyright&©&Ben&CartereGe& 11&Predicted$C=0$Predicted$C=1$Actual$C=0$A&B&Actual$C=1$C&D&Regression&• The&goal&of®ression&is&to&predict&the&value&of&a&numerical-variable&given&one&or&more&numeric,&ordinal,&or&categorical&variables&• DefiniKons:&– Y&=&numerical&variable/response/dependent&var&– X1,&…,&Xk&=&independent&variables/features/aGributes&– F:&&X1&×&X2&×&…&×&Xk&&C&is&the®ression&funcKon&• Given&training&data&with&features&X1…Xk&and&known&response&Y,&find&the&best&F&Copyright&©&Ben&CartereGe& 12&5/12/10&7&Linear&Regression&• In&linear®ression,&the&funcKon&F&is&a&linear&funcKon&of&variables&X1…Xk&– Y&=&F(X1…Xk)&=&β0&+&β1X1&+&β2X2&+&…&+&βkXk&• This&Kme,&learn&the&coefficients&β1…βk&from&data&– We&want&to&find&coefficients&that&minimize&residual&errors&&Y&–&(β0&+&β1X1&+&β2X2&+&…&+&βkXk)&• Define&total&error&as&– And&use&calculus&to&find&the&values&of&βs&that&minimize&total&error&Copyright&©&Ben&CartereGe& 13&err =n�i=1(Y − (β0+ β1X1+... + βkXk))2AssociaKon&Rules&• Find&values&of&a&field&that&frequently&go&together&when&linked&to&some&other&field&• “Market&basket&analysis”:&– A&shopping&cart&contains&several&items&– Who&is&making&the&most&purchases&of&an&item?&– What&items&do&customers&tend&to&buy&together?&– Is&there&an&order&in&which&purchases&are&made?&Copyright&©&Ben&CartereGe& 14&5/12/10&8&AssociaKon&Rules&• Look&for&the&following&paGerns:&– Coocurrences&• A%&of&customers&purchase&items&X,&Y,&Z&together&– AssociaKons&• B%&of&customers&who&purchase&X&and&Y&also&purchase&Z&– SequenKal&paGerns&• C%&of&customers&who&buy&X&will&purchase&Y&within&3&weeks&Copyright&©&Ben&CartereGe& 15&Confidence&and&Support&• The&confidence&in&a&paGern&and&the&support&for&that&paGern&are&two&measures&of&how&interesKng&it&is&•
View Full Document