A Data Mining Course for Computer Science

Home> Academic Documents> A Data Mining Course for Computer Science

DOC PREVIEW

This preview shows page 1-2-20-21 out of 21 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 21 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 21 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 21 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 21 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 21 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

A Data Mining Course for Computer Science Primary Sources and ImplementationsOverviewWhat is data mining?Why offer a course in data mining?Why research papers? Can it be done?Topics, Papers, AssignmentsTopic 0: What is Data Mining?Topic 1: Classification and RegressionTechnique: Nearest NeighborSlide 10Topic 2: ClusteringSlide 12Slide 13Topic 3: Association RulesTopic 4: Web MiningSlide 16Topic 5: Collaborative FilteringTopic 6: Ethical Issues in Data MiningSlide 19Final ProjectConclusionsA Data Mining Course for Computer SciencePrimary Sources and ImplementationsDave MusicantSaturday, March 4, 2006OverviewWhat is data mining?Why offer a course in data mining?Why focus on research papers in an undergraduate class?What topics do I cover?What research papers do I use in class?What assignments do I use?Does it work?What is data mining?“The non-trivial discovery of novel, valid, comprehensible and potentially useful patterns from data” (Fayyad et al)Data Mining and Machine Learning are two sides of the same coinData mining focuses more on larger datasetsMachine learning focuses more on connections with artificial intelligence... but there is much overlap in the two areas.My course is titled “Machine Learning and Data Mining”boosts student enthusiasmWhy offer a course in data mining?Interesting applied area of CS that uses theoretical techniquesReinforces and introduces data structures and algorithmsheaps, R-trees, graphsPrivacy and ethicsPersonal ownership in assignmentsStudents choose datasets in areas that interest themNew field, yet accessibleCan be done with only Data Structures as a prereqIt’s my research areaWhy research papers? Can it be done?One approach to course is to use data mining softwareLopez & Ludwig, University of Minnesota-MorrisI wanted students to implement data mining algorithmsTextbook support w/ computer science focus is limited(I use Margaret Dunham’s text as a side reference)Primary sources provide a rich experienceWith proper selection, papers are accessible to undergraduatesPapers must be supplemented in classroome.g. specific topics in linear algebra, statisticsdirects classroom activity toward filling gaps and interpreting papers instead of parroting readingTopics, Papers, AssignmentsEach topic consists of one or more papers that are assigned to the students to read before class discussion.Students post to Caucus (electronic message board):something they didn’t understand, or something they found interestingpotential exam questionAssignment follows class discussionDetailed references for all papers and datasets can be found in paperTopic 0: What is Data Mining?Paper: J. Friedman. “Data Mining and Statistics: What’s the Connection?”Entertaining and controversialPokes fun at flaws on all sidesHelps to ensure buy-in from computer science students (they haven’t been tricked into taking a stats course)Assignment: For the “census-income” dataset, determine:Number of records and featuresHow many features are continuous, how many are nominalFor continuous features: average, median, minimum, maximum, standard deviation2-dimensional scatter plots of two features at a timeInteresting patternsTopic 1: Classification and RegressionExample: First Trimester ScreeningUse this training set to learn how to classify patients where diagnosis is not known:The input data is often easily obtained, whereas the classification is not.Input Data ClassificationTraining SetTesting SetTechnique: Nearest NeighborEnvision each example as a point in n-dimensional spaceClassify test point same as nearest training pointWhat am I?Topic 1: Classification and RegressionFocus on scalable nearest neighbor algorithmsPaper: Roussopoulos et. al. “Nearest Neighbor Queries”How to do NN efficiently when data doesn’t fit in coreRequires R-trees (I cover in class)Assignment: Code up the traditional k-nearest neighbor algorithm, apply to census-income dataExperiment with different distance metrics (1-norm, 2-norm, cosine)Experiment with different values of kProduce plots showing training and test set accuraciesInterpret resultsTopic 2: ClusteringSometimes referred to as unsupervised learningGoal: find clusters of similar dataLess accurate than supervised learning, but quite useful when no training set is availableWhere are the clusters below? How many are there?chemical 1tissue(cm)chemical 2tissue(cm)Topic 2: ClusteringAssignment: Find dataset of interest from UCI Repositoryiris plant, letter recognition, liver disorders, Pima Indians diabetes, Congressional voting records, wine recognition, zoothis dataset is used for most remaining assignmentsif dataset has a class label, discard it for this assignmentImplement basic clustering algorithm (k-means)Try varying number of clustersTry two different techniques for initializing clustersReport and interpret results foundTopic 2: ClusteringPaper: Bradley et al, “Scaling Clustering Algorithms to Large Databases”Describes “Scalable K-means” algorithmClass discussion around “data mining desiderata”Paper: Guha et al, “CURE: An Efficient Clustering Algorithm for Large Databases”Agglomerative clustering algorithmcompletely different approachRequires use of a heap (as I pose the assignment)Assignment: Implement stripped-down version of CURERun on dataset, interpret resultsTopic 3: Association Rules“Supermarket basket analysis”What items do people tend do buy together at the same time?Paper: Agrawal et al, “Fast Algorithms for Mining Association Rules”presents classic Apriori algorithm (skim other portions of paper)Assignment: Implement Apriori algorithm and implement on own datasetTopic 4: Web MiningHow does Google rank importance of web pages?Every page has a PageRankPageRank of a page is determined by the PageRank of the pages that link to itmanifests itself as an eigenvalue problemPaper: Page et al, “The PageRank Citation Ranking: Bringing Order to the Web”describes basic version of Google PageRank algorithmcover eigenvalues in classexposure to linear algebra, numerical analysisTopic 4: Web MiningPaper: Chakrabarti et al, “Mining the Link Structure of the World Wide Web”describes HITS algorithm for ranking


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-20-21 out of 21 pages.

Please select your school