DOC PREVIEW
BU CS 565 - Course logistics

This preview shows page 1-2-14-15-29-30 out of 30 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CAS CS 565, Data MiningCourse logisticsTopics to be covered (tentative)Course workloadTextbooksPrerequisitesAbove allIntroduction to data miningWhy do we need data analysisThe data is also very complexExample: transaction dataExample: document dataExample: network dataExample: genomic sequencesExample: environmental dataWe have large datasets…so what?What can data-mining methods do?What can data-mining methods do?Goal of this courseData mining and related areasData mining vs machine learningData mining vs. statisticsData mining and databasesData mining and algorithmsSome simple data-analysis tasksFinding the majority elementFinding the majority element (solution)Finding the majority element (solution and correctness proof)Finding a number in the top halfFinding a number in the top half efficientlyCAS CS 565, Data MiningCourse logistics•Course webpage:–http://www.cs.bu.edu/~evimaria/cs565-12.html•Schedule: Mon – Wed, 4:00-5:30•Instructor: Evimaria Terzi, [email protected]•Office hours: Tues 9:00am-10:30pm, Wed 1:00pm-2:30pm (or by appointment)•Mailing list : [email protected] to be covered (tentative)•Introduction to data mining and prototype problems•Frequent pattern mining –Frequent itemsets and association rules•Clustering•Dimensionality reduction•Classification•Link analysis ranking•Recommendation systems•Time-series data•Privacy-preserving data miningCourse workload•Three programming assignments (30%)•Three problem sets (20%)•Midterm exam (20%)•Final exam (30%)•Late assignment policy: 10% per day up to three days; credit will be not given after that•Incompletes will not be givenTextbooks•D. Hand, H. Mannila and P. Smyth: Principles of Data Mining. MIT Press, 2001•Jiawer Han and Micheline Kamber: Data Mining: Concepts and Techiques. Second Edition. Morgan Kaufmann Publishers, March 2006•Toby Segaran: Programming Collective Intelligence: Building Smart Web 2.0 Applications. O’Reilly•Research papers (pointers will be provided)Prerequisites•Basic algorithms: sorting, set manipulation, hashing•Analysis of algorithms: O-notation and its variants, perhaps some recursion equations, NP-hardness•Programming: some programming language, ability to do small experiments reasonably quickly•Probability: concepts of probability and conditional probability, expectations, binomial and other simple distributions•Some linear algebra: e.g., eigenvector and eigenvalue computationsAbove all•The goal of the course is to learn and enjoy•The basic principle is to ask questions when you don’t understand•Say when things are unclear; not everything can be clear from the beginning•Participate in the class as much as possibleIntroduction to data mining•Why do we need data analysis?•What is data mining?•Examples where data mining has been useful•Data mining and other areas of computer science and statistics•Some (basic) data-mining tasksWhy do we need data analysis•Really really lots of raw data data!!–Moore’s law: more efficient processors, larger memories–Communications have improved too–Measurement technologies have improved dramatically–It possible to store and collect lots of raw data–The data-analysis methods are lagging behind•Need to analyze the raw data to extract knowledgeThe data is also very complex•Multiple types of data: tables, time series, images, graphs, etc•Spatial and temporal aspects•Large number of different variables•Lots of observations  large datasetsExample: transaction data•Billions of real-life customers: e.g., walmart, safeway customers, etc•Billions of online customers: e.g., amazon, expedia, etc.Example: document data•Web as a document repository: billions of web pages•Wikipedia: 4 million articles (and counting)•Online collections of scientific articlesExample: network data•Web: 50 billion pages linked via hyperlinks•Facebook: 400 million users•MySpace: 300 million users•Instant messenger: ~1billion users•Blogs: 250 million blogs worldwide, presidential candidates run blogsExample: genomic sequences•http://www.1000genomes.org/page.php•Full sequence of 1000 individuals•310^9 nucleotides per person  310^12 nucleotides•Lots more data in fact: medical history of the persons, gene expression dataExample: environmental data•Climate data (just an example)http://www.ncdc.gov/oa/climate/ghcn-monthly/index.php•“a database of temperature, precipitation and pressure records managed by the National Climatic Data Center, Arizona State University and the Carbon Dioxide Information Analysis Center”•“6000 temperature stations, 7500 precipitation stations, 2000 pressure stations”We have large datasets…so what?•Goal: obtain useful knowledge from large masses of data•“Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data analyst”•Tell me something interesting about the data; describe the data•Exploratory analysis on large datasetsWhat can data-mining methods do?•Extract frequent patterns–There are lots of documents that contain the phrases “association rules”, “data mining” and “efficient algorithm”•Extract association rules–80% of the walmart customers that buy beer and sausage also buy mustard•Extract rules–If occupation=PhD student then income < 20KWhat can data-mining methods do?•Rank web-query results–What are the most relevant web-pages to the query: “Student housing BU”?•Find good recommendations for users–Recommend amazon customers new books–Recommend facebook users new friends/groups•Find groups of entities that are similar (clustering)–Find groups of facebook users that have similar friends/interests–Find groups amazon users that buy similar products–Find groups of walmart customers that buy similar productsGoal of this course•Describe some problems that can be solved using data-mining methods•Discuss the intuition behind data-mining methods that solve these problems•Illustrate the theoretical underpinnings of these methods•Show how these methods can be useful in practiceData mining and related areas•How does data mining relate to machine learning?•How does data mining relate to statistics?•Other related areas?Data mining vs machine learning•Machine learning methods are used for data


View Full Document

BU CS 565 - Course logistics

Documents in this Course
Load more
Download Course logistics
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Course logistics and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Course logistics 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?