UD CISC 637 - Data Mining (Knowledge Discovery) - D1975462

Home> Schools> University of Delaware> Computer/Information Sciences (CISC) (CISC) > CISC 637> Data Mining (Knowledge Discovery)

DOC PREVIEW

UD CISC 637 - Data Mining (Knowledge Discovery)

School name University of Delaware

Course Cisc 637- Database Systems

Pages 10

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

5/12/10&1&Data&Mining&(Knowledge&Discovery)&CISC437/637,&Lecture&#22&Ben&CartereGe&1&Copyright&©&Ben&CartereGe&IntroducKon&to&Data&Minin g&• Data$mining&is&the&automa&c&exploraKon&and&analysis&of&large&quanKKes&of&data&in&order&to&discover&valid,&novel,&potenKally&useful,&and&understandable&paGerns&in&data&– Valid:&&the&paGerns&hold&in&general&– Novel:&&something&we&didn’t&know&before&– Useful:&&we&can&use&the&paGerns&to&accomplish&something&– Understandable:&&we&can&comprehend&why&the&paGerns&hold&• Example:&&if&degree=MS&and&income>75000,&then&credit=Excellent&(with&82%&probability)&2&Copyright&©&Ben&CartereGe&5/12/10&2&Two&Classes&of&PaGerns&• Predic8ons&about&new&data&items&based&on&knowledge&about&exisKng&data&items&– Classiﬁca8on$algorithms&predict&categorical&variables&– Regression&models&predict&numeric&variables$• Associa8ons$between&two&or&more&ﬁeld&values&across&records&– Clustering&discovers&groupings&of&records&according&to&ﬁeld&values$Copyright&©&Ben&CartereGe& 3&Variable&Types&• Variable&type&depends&on&domain:&– Numerical:&&domain&is&numeric&(integer,&real,&etc)&– Nominal&or&categorical:&&domain&i s&a&ﬁn ite&set&with&no&natural&ordering&(e.g.&occupaKon,&gender,&etc)&– Ordinal:&&domain&is&ordered&(e.g.&preference&scales,&severity&of&injury,&etc)&• The&types&of&variables&determine&what&kinds&of&data&mining&techniques&might&be&used&Copyright&©&Ben&CartereGe& 4&5/12/10&3&ClassiﬁcaKon&• The&goal&of&classiﬁcaKon&is&to&predict&the&value&of&a&categorical-variable&given&one&or&more&numeric,&ordinal,&or&categorical&variables&• DeﬁniKons:&– C&=&categorical&variable/class&labels&– X1,&…,&Xk&=&variables/features/aGributes&– F:&&X1&×&X2&×&…&×&Xk&&C&is&the&classiﬁcaKon&funcKon&• Given&training$data&with&features&X1…Xk&and&known&class&C,&ﬁnd&the&best&F&Copyright&©&Ben&CartereGe& 5&ClassiﬁcaKon&• We&do&not&expect&that&the&classiﬁcaKon&will&be&exact&– Errors&will&occur;&our&goal&is&to&minimize&them&• The&funcKon&F&should:&– Be&able&to&classify&items&with&high&accuracy&– Produce&informaKon&(the&class)&that&we&don’t&have&and&that&is&useful&for&something&– Be&interpretable&by&those&using&it&(i.e.&it&should&make&sense)&Copyright&©&Ben&CartereGe& 6&5/12/10&4&Decision&Trees&• Decision$trees$are&a&popular&algorithm&for&classiﬁcaKon&• A&decision&tree&is&basically&a&hierarchy&of&if/then/else&(or&switch&case)&statements&– Each&node&corresponds&to&a&switch&case&on&a&variable&– The&case&determines&which&node&to&go&to&next&– The&leaves&of&the&tree&correspond&to&a&predicted&class&label&for&the&record&• But&instead&of&being&handhcoded,&the&cases&are&automaKcally&learned&from&data&Copyright&©&Ben&CartereGe& 7&Learning&Decision&Trees&• Start&with&a&training&set:&&data&records&from&which&we&can&compute&aGributes&X1,…,Xk&and&class&label&C&• Recursive&algorithm&outline:&LearnDecisionTree(R)&– Choose&a&par88oning$a?ribute&Xi&• If&no&parKKoning&is&possible,&return&– Split&records&in&R&into&groups&R1,&R2,&…,&Rj&by&values&of&Xi&– For&each&group&Ri,&• LearnDecisionTree(Ri)&Copyright&©&Ben&CartereGe& 8&5/12/10&5&Choosing&AGributes&• The&best&aGribute&is&the&one&that&can&most&cleanly&separates&records&according&to&their&class&• Calculate&proporKons&of&records&remaining&in&each&class&– pi&=&(#&records&in&class&i)/(total&#&of&records)&• Use&the&idea&of&entropy&from&informaKon&theory&– Entropy&=&H(Y)&=&number&of&bits&needed&to&encode&Y&• Entropy&of&class&=&&Copyright&©&Ben&CartereGe& 9&H(C)=−�ipilog2piChoosing&AGributes&• Calculate&condi8onal$entropy&of&C&given&Xi&(denoted&H(C|Xi))&as&follows:&– For&each&aGribute&Xi,&• Let&H(C|Xi)&=&0&• For&each&unique&value&xj&of&Xi,&– Calculate&H(C)&for&records&with&Xi=xj&– Calculate&pij&=&proporKon&of&records&with&Xi=xj&– H(C|Xi)&+=&pij*H(C)&• Finally,&calculate&informa8on$gain$for&each&aGr:&– IG(C;&Xi)&=&H(C)&–&H(C|Xi)&• Split&on&the&aGribute&with&greatest&IG&Copyright&©&Ben&CartereGe& 10&5/12/10&6&EvaluaKng&Performance&• We&use&the&trained&classiﬁer&to&predict&the&class&of&new&data&records&• When&we&can&obtain&the&actual&class,&we&can&compare&it&to&our&predicKon&to&see&how&we&did&• This&is&evalua8on&of&the&accuracy&of&the&classiﬁer&• Based&on&a&con8ngency$table:&• Accuracy&=&(A+D)/(A+B+C+D)&Copyright&©&Ben&CartereGe& 11&Predicted$C=0$Predicted$C=1$Actual$C=0$A&B&Actual$C=1$C&D&Regression&• The&goal&of&regression&is&to&predict&the&value&of&a&numerical-variable&given&one&or&more&numeric,&ordinal,&or&categorical&variables&• DeﬁniKons:&– Y&=&numerical&variable/response/dependent&var&– X1,&…,&Xk&=&independent&variables/features/aGributes&– F:&&X1&×&X2&×&…&×&Xk&&C&is&the&regression&funcKon&• Given&training&data&with&features&X1…Xk&and&known&response&Y,&ﬁnd&the&best&F&Copyright&©&Ben&CartereGe& 12&5/12/10&7&Linear&Regression&• In&linear&regression,&the&funcKon&F&is&a&linear&funcKon&of&variables&X1…Xk&– Y&=&F(X1…Xk)&=&β0&+&β1X1&+&β2X2&+&…&+&βkXk&• This&Kme,&learn&the&coeﬃcients&β1…βk&from&data&– We&want&to&ﬁnd&coeﬃcients&that&minimize&residual&errors&&Y&–&(β0&+&β1X1&+&β2X2&+&…&+&βkXk)&• Deﬁne&total&error&as&– And&use&calculus&to&ﬁnd&the&values&of&βs&that&minimize&total&error&Copyright&©&Ben&CartereGe& 13&err =n�i=1(Y − (β0+ β1X1+... + βkXk))2AssociaKon&Rules&• Find&values&of&a&ﬁeld&that&frequently&go&together&when&linked&to&some&other&ﬁeld&• “Market&basket&analysis”:&– A&shopping&cart&contains&several&items&– Who&is&making&the&most&purchases&of&an&item?&– What&items&do&customers&tend&to&buy&together?&– Is&there&an&order&in&which&purchases&are&made?&Copyright&©&Ben&CartereGe& 14&5/12/10&8&AssociaKon&Rules&• Look&for&the&following&paGerns:&– Coocurrences&• A%&of&customers&purchase&items&X,&Y,&Z&together&– AssociaKons&• B%&of&customers&who&purchase&X&and&Y&also&purchase&Z&– SequenKal&paGerns&• C%&of&customers&who&buy&X&will&purchase&Y&within&3&weeks&Copyright&©&Ben&CartereGe& 15&Conﬁdence&and&Support&• The&conﬁdence&in&a&paGern&and&the&support&for&that&paGern&are&two&measures&of&how&interesKng&it&is&•

View Full Document