A Data Mining Course for Computer Science Primary Sources and ImplementationsOverviewWhat is data mining?Why offer a course in data mining?Why research papers? Can it be done?Topics, Papers, AssignmentsTopic 0: What is Data Mining?Topic 1: Classification and RegressionTechnique: Nearest NeighborSlide 10Topic 2: ClusteringSlide 12Slide 13Topic 3: Association RulesTopic 4: Web MiningSlide 16Topic 5: Collaborative FilteringTopic 6: Ethical Issues in Data MiningSlide 19Final ProjectConclusionsA Data Mining Course for Computer SciencePrimary Sources and ImplementationsDave MusicantSaturday, March 4, 2006OverviewWhat is data mining?Why offer a course in data mining?Why focus on research papers in an undergraduate class?What topics do I cover?What research papers do I use in class?What assignments do I use?Does it work?What is data mining?“The non-trivial discovery of novel, valid, comprehensible and potentially useful patterns from data” (Fayyad et al)Data Mining and Machine Learning are two sides of the same coinData mining focuses more on larger datasetsMachine learning focuses more on connections with artificial intelligence... but there is much overlap in the two areas.My course is titled “Machine Learning and Data Mining”boosts student enthusiasmWhy offer a course in data mining?Interesting applied area of CS that uses theoretical techniquesReinforces and introduces data structures and algorithmsheaps, R-trees, graphsPrivacy and ethicsPersonal ownership in assignmentsStudents choose datasets in areas that interest themNew field, yet accessibleCan be done with only Data Structures as a prereqIt’s my research areaWhy research papers? Can it be done?One approach to course is to use data mining softwareLopez & Ludwig, University of Minnesota-MorrisI wanted students to implement data mining algorithmsTextbook support w/ computer science focus is limited(I use Margaret Dunham’s text as a side reference)Primary sources provide a rich experienceWith proper selection, papers are accessible to undergraduatesPapers must be supplemented in classroome.g. specific topics in linear algebra, statisticsdirects classroom activity toward filling gaps and interpreting papers instead of parroting readingTopics, Papers, AssignmentsEach topic consists of one or more papers that are assigned to the students to read before class discussion.Students post to Caucus (electronic message board):something they didn’t understand, or something they found interestingpotential exam questionAssignment follows class discussionDetailed references for all papers and datasets can be found in paperTopic 0: What is Data Mining?Paper: J. Friedman. “Data Mining and Statistics: What’s the Connection?”Entertaining and controversialPokes fun at flaws on all sidesHelps to ensure buy-in from computer science students (they haven’t been tricked into taking a stats course)Assignment: For the “census-income” dataset, determine:Number of records and featuresHow many features are continuous, how many are nominalFor continuous features: average, median, minimum, maximum, standard deviation2-dimensional scatter plots of two features at a timeInteresting patternsTopic 1: Classification and RegressionExample: First Trimester ScreeningUse this training set to learn how to classify patients where diagnosis is not known:The input data is often easily obtained, whereas the classification is not.Input Data ClassificationTraining SetTesting SetTechnique: Nearest NeighborEnvision each example as a point in n-dimensional spaceClassify test point same as nearest training pointWhat am I?Topic 1: Classification and RegressionFocus on scalable nearest neighbor algorithmsPaper: Roussopoulos et. al. “Nearest Neighbor Queries”How to do NN efficiently when data doesn’t fit in coreRequires R-trees (I cover in class)Assignment: Code up the traditional k-nearest neighbor algorithm, apply to census-income dataExperiment with different distance metrics (1-norm, 2-norm, cosine)Experiment with different values of kProduce plots showing training and test set accuraciesInterpret resultsTopic 2: ClusteringSometimes referred to as unsupervised learningGoal: find clusters of similar dataLess accurate than supervised learning, but quite useful when no training set is availableWhere are the clusters below? How many are there?chemical 1tissue(cm)chemical 2tissue(cm)Topic 2: ClusteringAssignment: Find dataset of interest from UCI Repositoryiris plant, letter recognition, liver disorders, Pima Indians diabetes, Congressional voting records, wine recognition, zoothis dataset is used for most remaining assignmentsif dataset has a class label, discard it for this assignmentImplement basic clustering algorithm (k-means)Try varying number of clustersTry two different techniques for initializing clustersReport and interpret results foundTopic 2: ClusteringPaper: Bradley et al, “Scaling Clustering Algorithms to Large Databases”Describes “Scalable K-means” algorithmClass discussion around “data mining desiderata”Paper: Guha et al, “CURE: An Efficient Clustering Algorithm for Large Databases”Agglomerative clustering algorithmcompletely different approachRequires use of a heap (as I pose the assignment)Assignment: Implement stripped-down version of CURERun on dataset, interpret resultsTopic 3: Association Rules“Supermarket basket analysis”What items do people tend do buy together at the same time?Paper: Agrawal et al, “Fast Algorithms for Mining Association Rules”presents classic Apriori algorithm (skim other portions of paper)Assignment: Implement Apriori algorithm and implement on own datasetTopic 4: Web MiningHow does Google rank importance of web pages?Every page has a PageRankPageRank of a page is determined by the PageRank of the pages that link to itmanifests itself as an eigenvalue problemPaper: Page et al, “The PageRank Citation Ranking: Bringing Order to the Web”describes basic version of Google PageRank algorithmcover eigenvalues in classexposure to linear algebra, numerical analysisTopic 4: Web MiningPaper: Chakrabarti et al, “Mining the Link Structure of the World Wide Web”describes HITS algorithm for ranking
or
We will never post anything without your permission.
Don't have an account? Sign up