Unformatted text preview:

1Data MiningData MiningCS 341, Spring 2007CS 341, Spring 2007Lecture 7: DistanceLecture 7: Distance--based Algorithms based Algorithms & decision tree& decision tree--based algorithmsbased algorithms© Prentice Hall 2Review:Review:nnClassification IssuesClassification IssuesnnTechniquesTechniques––RegressionRegression»»Linear regressionLinear regression»»Logistic regressionLogistic regression––Bayesian ClassificationBayesian Classification»»NaNaïïve Bayesian Classificationve Bayesian Classification»»Six stepsSix stepsThink of some classification applications that you may want to use regression or Bayesian classification? © Prentice Hall 3Classification OutlineClassification OutlinennClassification Problem OverviewClassification Problem OverviewnnClassification TechniquesClassification Techniques––RegressionRegression––Bayesian ClassificationBayesian Classification––DistanceDistance––Decision TreesDecision TreesGoal:Goal:Provide an overview of the classification Provide an overview of the classification problem and introduce some of the basic problem and introduce some of the basic algorithmsalgorithms© Prentice Hall 4Classification Using DistanceClassification Using DistancennPlace items in class to which they are Place items in class to which they are ““closestclosest””..nnAn item is more similar to the other items in An item is more similar to the other items in the same class than it is to the items in other the same class than it is to the items in other classes.classes.nnMust determine distance (or similarity) Must determine distance (or similarity) between an item and a class.between an item and a class.nnAn example in IRAn example in IR––An IR query as the definition of a classAn IR query as the definition of a class––Determine the similarity between each Determine the similarity between each tupletuple(or (or document) and the query.document) and the query.© Prentice Hall 5Classification Using DistanceClassification Using DistancennSimple approachSimple approach––Given a database D = (tGiven a database D = (t11, t, t22, , ……,,ttnn) of ) of tuplestupleswhere each where each tupletuplettiiis a vector of numeric is a vector of numeric values;values;––and a set of classes C = (Cand a set of classes C = (C11, C, C22, , ……,C,Cmm) ) where each class where each class CCjjis a vector of numeric is a vector of numeric values;values;––Assign each Assign each ttiito the class to the class CCjjsuch that such that sim(tsim(tii, , CCjj) ) ≤≤sim(tsim(tii, C, Ckk), for any C), for any Ck k where where CCjj≠≠CCkk. . © Prentice Hall 6Classification Using DistanceClassification Using Distance——simple approachsimple approachnnRepresenting each classRepresenting each class––CentroidCentroid::center or central value.center or central value.––Predefined patternPredefined pattern––Representative Representative pointspointsnnSimilarity measuresSimilarity measures––DiceDice––JaccardJaccard––CosineCosine––OverlapOverlap2© Prentice Hall 7K Nearest Neighbor (KNN):K Nearest Neighbor (KNN):nnTraining set includes data and desired Training set includes data and desired classes.classes.nnExamine K items near the item to be Examine K items near the item to be classified.classified.nnNew item placed in class with the most New item placed in class with the most number of close items.number of close items.nnComplexity: Complexity: O(qO(q) for each tuple to be ) for each tuple to be classified. (Here q is the size of the classified. (Here q is the size of the training set.)training set.)© Prentice Hall 8KNNKNN© Prentice Hall 9KNN AlgorithmKNN AlgorithmExplain how this algorithm works.© Prentice Hall 10Example:Example:nnUsing the sample data from Table 4.1 Using the sample data from Table 4.1 and the output1 classification as the and the output1 classification as the training set output value. We classify training set output value. We classify the the tupletuple(Pat, F, 1.6) using KNN where (Pat, F, 1.6) using KNN where K = 5.K = 5.© Prentice Hall 11Height Example DataHeight Example DataN am e G end er Height O utput1 O utput2 Kristina F 1.6m Short M edium Jim M 2m Tall M edium M aggie F 1.9m M edium Tall M artha F 1.88m M edium Tall Stephanie F 1.7m Short M edium Bob M 1.85m M edium M edium Kathy F 1.6m Short M edium D ave M 1.7m Short M edium W orth M 2.2m Tall Tall Steven M 2.1m Tall Tall D ebbie F 1.8m M edium M edium Todd M 1.95m M edium M edium Kim F 1.9m M edium Tall Amy F 1.8m M edium M edium W ynette F 1.75m M edium M edium © Prentice Hall 12KNN: pros and consKNN: pros and consnnSimple, no definition of classesSimple, no definition of classesnnSensitive to the value of KSensitive to the value of K––Let KLet K22≤≤the number of training items.the number of training items.––Commercial algorithms use a default value Commercial algorithms use a default value of 10.of 10.Give an example to show how the K value may affect the classification.3© Prentice Hall 13Classification OutlineClassification OutlinennClassification Problem OverviewClassification Problem OverviewnnClassification TechniquesClassification Techniques––RegressionRegression––DistanceDistance––Decision TreesDecision Trees––RulesRules––Neural NetworksNeural NetworksGoal:Goal:Provide an overview of the classification Provide an overview of the classification problem and introduce some of the basic problem and introduce some of the basic algorithmsalgorithms© Prentice Hall 14Classification Using Decision Classification Using Decision TreesTreesnnPartitioning based:Partitioning based:Divide search Divide search space into rectangular regions.space into rectangular regions.nnTuple placed into class based on the Tuple placed into class based on the region within which it falls.region within which it falls.nnDT approaches differ in how the tree is DT approaches differ in how the tree is built: built: DT InductionDT InductionnnInternal nodes associated with attribute Internal nodes associated with attribute and arcs with values for that attribute.and arcs with values for that attribute.nnAlgorithms: ID3, C4.5, CARTAlgorithms: ID3, C4.5, CART© Prentice Hall 15Decision TreeDecision TreeGiven: Given: ––D = {tD = {t11, , ……, , ttnn} where } where ttii=<t=<ti1i1, , ……, , ttihih> > ––Database schema


View Full Document

Mt Holyoke CS 341 - Data Mining

Download Data Mining
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Data Mining and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Data Mining 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?